Re: Not Serializable exception when integrating SQL and Spark Streaming
Generally you can use |-Dsun.io.serialization.extendedDebugInfo=true| to enable serialization debugging information when serialization exceptions are raised. On 12/24/14 1:32 PM, bigdata4u wrote: I am trying to use sql over Spark streaming using Java. But i am getting Serialization Exception. public static void main(String args[]) { SparkConf sparkConf = new SparkConf().setAppName(NumberCount); JavaSparkContext jc = new JavaSparkContext(sparkConf); JavaStreamingContext jssc = new JavaStreamingContext(jc, new Duration(2000)); jssc.addStreamingListener(new WorkCountMonitor()); int numThreads = Integer.parseInt(args[3]); MapString,Integer topicMap = new HashMapString,Integer(); String[] topics = args[2].split(,); for (String topic : topics) { topicMap.put(topic, numThreads); } JavaPairReceiverInputDStreamString,String data = KafkaUtils.createStream(jssc, args[0], args[1], topicMap); data.print(); JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String, String, Person() { public Person call(Tuple2String,String v1) throws Exception { String[] stringArray = v1._2.split(,); Person Person = new Person(); Person.setName(stringArray[0]); Person.setAge(stringArray[1]); return Person; } }); final JavaSQLContext sqlContext = new JavaSQLContext(jc); streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() { public Void call(JavaRDDPerson rdd) { JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd, Person.class); subscriberSchema.registerAsTable(people); System.out.println(all data); JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people); System.out.println(afterwards); ListString males = new ArrayListString(); males = names.map(new FunctionRow,String() { public String call(Row row) { return row.getString(0); } }).collect(); System.out.println(before for); for (String name : males) { System.out.println(name); } return null; } }); jssc.start(); jssc.awaitTermination(); Exception is 14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job 1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at org.apache.spark.rdd.RDD.map(RDD.scala:271) at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at com.basic.spark.NumberCount$2.call(NumberCount.java:79) at com.basic.spark.NumberCount$2.call(NumberCount.java:67) at org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274) at org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274) at org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529) at org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529) at org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.io.NotSerializableException: org.apache.spark.sql.api.java.JavaSQLContext at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) at
Re: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD
Hao and Lam - I think the issue here is that |registerRDDAsTable| only creates a temporary table, which is not seen by Hive metastore. And Michael had once given a workaround for creating external Parquet table: http://apache-spark-user-list.1001560.n3.nabble.com/persist-table-schema-in-spark-sql-td16297.html Cheng On 12/24/14 9:38 AM, Cheng, Hao wrote: Hi, Lam, I can confirm this is a bug with the latest master, and I filed a jira issue for this: https://issues.apache.org/jira/browse/SPARK-4944 Hope come with a solution soon. Cheng Hao *From:*Jerry Lam [mailto:chiling...@gmail.com] *Sent:* Wednesday, December 24, 2014 4:26 AM *To:* user@spark.apache.org *Subject:* SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I would like to use the schema in the schemaRDD (rdd_table) when I create the external table. For example: rdd_table.saveAsParquetFile(/user/spark/my_data.parquet) hiveContext.registerRDDAsTable(rdd_table, rdd_table) hiveContext.sql(CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION '/user/spark/my_data.parquet') the last line fails with: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not found rdd_table at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382) Is this supported? Best Regards, Jerry
Re: Spark Installation Maven PermGen OutOfMemoryException
Java 8 rpm 64bit downloaded from official oracle site solved my problem. And I need not set max heap size, final memory shown at the end of maven build was 81/1943M. I want to learn spark so have no restriction on choosing java version. Guru Medasani, thanks for the tip. I will repeat info, that I wrongly send only to Sean. I have tried export MAVEN_OPTS=`-Xmx=3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=1g` and it doesn't work also. Best Regards, Vladimir Protsenko 2014-12-23 19:45 GMT+04:00 Guru Medasani gdm...@outlook.com: Thanks for the clarification Sean. Best Regards, Guru Medasani From: so...@cloudera.com Date: Tue, 23 Dec 2014 15:39:59 + Subject: Re: Spark Installation Maven PermGen OutOfMemoryException To: gdm...@outlook.com CC: protsenk...@gmail.com; user@spark.apache.org The text there is actually unclear. In Java 8, you still need to set the max heap size (-Xmx2g). The optional bit is the -XX:MaxPermSize=512M actually. Java 8 no longer has a separate permanent generation. On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com wrote: Hi Vladimir, From the link Sean posted, if you use Java 8 there is this following note. Note: For Java 8 and above this step is not required. So if you have no problems using Java 8, give it a shot. Best Regards, Guru Medasani - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Installation Maven PermGen OutOfMemoryException
That command is still wrong. It is -Xmx3g with no =. On Dec 24, 2014 9:50 AM, Vladimir Protsenko protsenk...@gmail.com wrote: Java 8 rpm 64bit downloaded from official oracle site solved my problem. And I need not set max heap size, final memory shown at the end of maven build was 81/1943M. I want to learn spark so have no restriction on choosing java version. Guru Medasani, thanks for the tip. I will repeat info, that I wrongly send only to Sean. I have tried export MAVEN_OPTS=`-Xmx=3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=1g` and it doesn't work also. Best Regards, Vladimir Protsenko 2014-12-23 19:45 GMT+04:00 Guru Medasani gdm...@outlook.com: Thanks for the clarification Sean. Best Regards, Guru Medasani From: so...@cloudera.com Date: Tue, 23 Dec 2014 15:39:59 + Subject: Re: Spark Installation Maven PermGen OutOfMemoryException To: gdm...@outlook.com CC: protsenk...@gmail.com; user@spark.apache.org The text there is actually unclear. In Java 8, you still need to set the max heap size (-Xmx2g). The optional bit is the -XX:MaxPermSize=512M actually. Java 8 no longer has a separate permanent generation. On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com wrote: Hi Vladimir, From the link Sean posted, if you use Java 8 there is this following note. Note: For Java 8 and above this step is not required. So if you have no problems using Java 8, give it a shot. Best Regards, Guru Medasani - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell
this is my problem. I use mysql to store hive meta data. and i can get what i want when I exec show tables in hive shell. but in the same machine. I use spark-sql to execute same command (show tables), I got errors. I look at the log of hive metastore find this errors 2014-12-24 05:04:59,874 ERROR [pool-3-thread-2]: server.TThreadPoolServer (TThreadPoolServer.java:run(294)) - Thrift error occurred during processing of message. org.apache.thrift.protocol.TProtocolException: Expected protocol id ff82 but got ff80 at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-24 05:05:01,883 ERROR [pool-3-thread-3]: server.TThreadPoolServer (TThreadPoolServer.java:run(294)) - Thrift error occurred during processing of message. org.apache.thrift.protocol.TProtocolException: Expected protocol id ff82 but got ff80 at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I compiled e spark from soucecode of spark-1.1.1 and my hive was compiled from apache-hive-0.15.0 I put the same hive-site.xml both hive conf and apark conf. waiting for help thanks
Re: How to build Spark against the latest
Hi Ted, The reference command works, but where I can get the deployable binaries? Xiaobo Gu -- Original -- From: Ted Yu;yuzhih...@gmail.com; Send time: Wednesday, Dec 24, 2014 12:09 PM To: guxiaobo1...@qq.com; Cc: user@spark.apache.orguser@spark.apache.org; Subject: Re: How to build Spark against the latest See http://search-hadoop.com/m/JW1q5Cew0j On Tue, Dec 23, 2014 at 8:00 PM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, The official pom.xml file only have profile for hadoop version 2.4 as the latest version, but I installed hadoop version 2.6.0 with ambari, how can I build spark against it, just using mvn -Dhadoop.version=2.6.0, or how to make a coresponding profile for it? Regards, Xiaobo
Need help for Spark-JobServer setup on Maven (for Java programming)
Dear All, We are trying to share RDDs across different sessions of same Web application (Java). We need to share single RDD between those sessions. As we understand from some posts, it is possible through Spark-JobServer. Is there any guidelines you can provide to setup Spark-JobServer for Maven (for Apache Spark - Java programming). We will be glad for your suggestion. Sasi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?
Hello, I have a piece of code that runs inside Spark Streaming and tries to get some data from a RESTful web service (that runs locally on my machine). The code snippet in question is: Client client = ClientBuilder.newClient(); WebTarget target = client.target(http://localhost:/rest;); target = target.path(annotate) .queryParam(text, UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission)) .queryParam(confidence, 0.3); logger.warn(!!! DEBUG !!! target: {}, target.getUri().toString()); String response = target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class); logger.warn(!!! DEBUG !!! Spotlight response: {}, response); When run inside a unit test as follows: mvn clean test -Dtest=SpotlightTest#testCountWords it contacts the RESTful web service and retrieves some data as expected. But when the same code is run as part of the application that is submitted to Spark, using spark-submit script I receive the following error: java.lang.NoSuchMethodError: javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey in my project's pom.xml: dependency groupIdorg.glassfish.jersey.containers/groupId artifactIdjersey-container-servlet-core/artifactId version2.14/version /dependency So I suspect that when the application is submitted to Spark, somehow there's a different JAR in the environment that uses a different version of Jersey / javax.ws.rs.* Does anybody know which version of Jersey / javax.ws.rs.* is used in the Spark environment, or how to solve this conflict? -- Emre Sevinç https://be.linkedin.com/in/emresevinc/
Re: got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell
Hi Roc, Spark SQL 1.2.0 can only work with Hive 0.12.0 or Hive 0.13.1 (controlled by compilation flags), versions prior 1.2.0 only works with Hive 0.12.0. So Hive 0.15.0-SNAPSHOT is not an option. Would like to add that this is due to backwards compatibility issue of Hive metastore, AFAIK there's no popular system works with multiple versions of Hive metastore simultaneously. Cheng On 12/24/14 6:31 PM, Roc Chu wrote: this is my problem. I use mysql to store hive meta data. and i can get what i want when I exec show tables in hive shell. but in the same machine. I use spark-sql to execute same command (show tables), I got errors. I look at the log of hive metastore find this errors 2014-12-24 05:04:59,874 ERROR [pool-3-thread-2]: server.TThreadPoolServer (TThreadPoolServer.java:run(294)) - Thrift error occurred during processing of message. org.apache.thrift.protocol.TProtocolException: Expected protocol id ff82 but got ff80 at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-12-24 05:05:01,883 ERROR [pool-3-thread-3]: server.TThreadPoolServer (TThreadPoolServer.java:run(294)) - Thrift error occurred during processing of message. org.apache.thrift.protocol.TProtocolException: Expected protocol id ff82 but got ff80 at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:503) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:75) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I compiled e spark from soucecode of spark-1.1.1 and my hive was compiled from apache-hive-0.15.0 I put the same hive-site.xml both hive conf and apark conf. waiting for help thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Installation Maven PermGen OutOfMemoryException
Thanks. Bad mistake. 2014-12-24 14:02 GMT+04:00 Sean Owen so...@cloudera.com: That command is still wrong. It is -Xmx3g with no =. On Dec 24, 2014 9:50 AM, Vladimir Protsenko protsenk...@gmail.com wrote: Java 8 rpm 64bit downloaded from official oracle site solved my problem. And I need not set max heap size, final memory shown at the end of maven build was 81/1943M. I want to learn spark so have no restriction on choosing java version. Guru Medasani, thanks for the tip. I will repeat info, that I wrongly send only to Sean. I have tried export MAVEN_OPTS=`-Xmx=3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=1g` and it doesn't work also. Best Regards, Vladimir Protsenko 2014-12-23 19:45 GMT+04:00 Guru Medasani gdm...@outlook.com: Thanks for the clarification Sean. Best Regards, Guru Medasani From: so...@cloudera.com Date: Tue, 23 Dec 2014 15:39:59 + Subject: Re: Spark Installation Maven PermGen OutOfMemoryException To: gdm...@outlook.com CC: protsenk...@gmail.com; user@spark.apache.org The text there is actually unclear. In Java 8, you still need to set the max heap size (-Xmx2g). The optional bit is the -XX:MaxPermSize=512M actually. Java 8 no longer has a separate permanent generation. On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com wrote: Hi Vladimir, From the link Sean posted, if you use Java 8 there is this following note. Note: For Java 8 and above this step is not required. So if you have no problems using Java 8, give it a shot. Best Regards, Guru Medasani - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?
Your guess is right, that there are two incompatible versions of Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey, but its transitive dependencies may, or your transitive dependencies may. I don't see Jersey in Spark's dependency tree except from HBase tests, which in turn only appear in examples, so that's unlikely to be it. I'd take a look with 'mvn dependency:tree' on your own code first. Maybe you are including JavaEE 6 for example? On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, I have a piece of code that runs inside Spark Streaming and tries to get some data from a RESTful web service (that runs locally on my machine). The code snippet in question is: Client client = ClientBuilder.newClient(); WebTarget target = client.target(http://localhost:/rest;); target = target.path(annotate) .queryParam(text, UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission)) .queryParam(confidence, 0.3); logger.warn(!!! DEBUG !!! target: {}, target.getUri().toString()); String response = target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class); logger.warn(!!! DEBUG !!! Spotlight response: {}, response); When run inside a unit test as follows: mvn clean test -Dtest=SpotlightTest#testCountWords it contacts the RESTful web service and retrieves some data as expected. But when the same code is run as part of the application that is submitted to Spark, using spark-submit script I receive the following error: java.lang.NoSuchMethodError: javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey in my project's pom.xml: dependency groupIdorg.glassfish.jersey.containers/groupId artifactIdjersey-container-servlet-core/artifactId version2.14/version /dependency So I suspect that when the application is submitted to Spark, somehow there's a different JAR in the environment that uses a different version of Jersey / javax.ws.rs.* Does anybody know which version of Jersey / javax.ws.rs.* is used in the Spark environment, or how to solve this conflict? -- Emre Sevinç https://be.linkedin.com/in/emresevinc/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0
Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?) using hbase 0.98.6 and yarn-client mode. thanks,Antony.
Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?
On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote: I'd take a look with 'mvn dependency:tree' on your own code first. Maybe you are including JavaEE 6 for example? For reference, my complete pom.xml looks like: project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi= http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd; modelVersion4.0.0/modelVersion groupIdbigcontent/groupId artifactIdbigcontent/artifactId version1.0-SNAPSHOT/version packagingjar/packaging namebigcontent/name urlhttp://maven.apache.org/url properties project.build.sourceEncodingUTF-8/project.build.sourceEncoding /properties build plugins plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-shade-plugin/artifactId version2.3/version configuration !-- put your configurations here -- /configuration executions execution phasepackage/phase goals goalshade/goal /goals /execution /executions /plugin plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-compiler-plugin/artifactId version3.2/version configuration source1.7/source target1.7/target /configuration /plugin /plugins /build dependencies dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.1.1/version scopeprovided/scope /dependency dependency groupIdorg.glassfish.jersey.containers/groupId artifactIdjersey-container-servlet-core/artifactId version2.14/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0/version /dependency dependency groupIdcom.google.guava/groupId artifactIdguava/artifactId version16.0/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-mapreduce-client-core/artifactId version2.4.0/version /dependency dependency groupIdjson-mapreduce/groupId artifactIdjson-mapreduce/artifactId version1.0-SNAPSHOT/version exclusions exclusion groupIdjavax.servlet/groupId artifactId*/artifactId /exclusion exclusion groupIdcommons-io/groupId artifactId*/artifactId /exclusion exclusion groupIdcommons-lang/groupId artifactId*/artifactId /exclusion exclusion groupIdorg.apache.hadoop/groupId artifactIdhadoop-common/artifactId /exclusion /exclusions /dependency dependency groupIdorg.apache.avro/groupId artifactIdavro-mapred/artifactId version1.7.7/version exclusions exclusion groupIdjavax.servlet/groupId artifactId*/artifactId /exclusion exclusion groupIdorg.apache.hadoop/groupId artifactIdhadoop-common/artifactId /exclusion /exclusions /dependency dependency groupIdjunit/groupId artifactIdjunit/artifactId version4.11/version scopetest/scope /dependency dependency groupIdorg.apache.avro/groupId artifactIdavro/artifactId version1.7.7/version exclusions exclusion groupIdjavax.servlet/groupId artifactId*/artifactId /exclusion /exclusions /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-common/artifactId version2.4.0/version scopeprovided/scope exclusions exclusion groupIdjavax.servlet/groupId artifactId*/artifactId /exclusion exclusion groupIdcom.google.guava/groupId artifactId*/artifactId /exclusion /exclusions /dependency dependency groupIdorg.slf4j/groupId artifactIdslf4j-log4j12/artifactId version1.7.7/version /dependency /dependencies /project And 'mvn dependency:tree' produces the following output: [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building bigcontent 1.0-SNAPSHOT [INFO] [INFO] [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ bigcontent --- [INFO] bigcontent:bigcontent:jar:1.0-SNAPSHOT [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.1.1:provided [INFO] | +- org.apache.spark:spark-core_2.10:jar:1.1.1:provided [INFO] | | +- org.apache.curator:curator-recipes:jar:2.4.0:provided [INFO] | | | \- org.apache.curator:curator-framework:jar:2.4.0:provided [INFO] | | | \-
Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?
It seems like YARN depends an older version of Jersey, that is 1.9: https://github.com/apache/spark/blob/master/yarn/pom.xml When I've modified my dependencies to have only: dependency groupIdcom.sun.jersey/groupId artifactIdjersey-core/artifactId version1.9.1/version /dependency And then modified the code to use the old Jersey API: Client c = Client.create(); WebResource r = c.resource(http://localhost:/rest;) .path(annotate) .queryParam(text, UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission)) .queryParam(confidence, 0.3); logger.warn(!!! DEBUG !!! target: {}, r.getURI()); String response = r.accept(MediaType.APPLICATION_JSON_TYPE) //.header() .get(String.class); logger.warn(!!! DEBUG !!! Spotlight response: {}, response); It seems to work when I use spark-submit to submit the application that includes this code. Funny thing is, now my relevant unit test does not run, complaining about not having enough memory: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0xc490, 25165824, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 25165824 bytes for committing reserved memory. -- Emre On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Your guess is right, that there are two incompatible versions of Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey, but its transitive dependencies may, or your transitive dependencies may. I don't see Jersey in Spark's dependency tree except from HBase tests, which in turn only appear in examples, so that's unlikely to be it. I'd take a look with 'mvn dependency:tree' on your own code first. Maybe you are including JavaEE 6 for example? On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, I have a piece of code that runs inside Spark Streaming and tries to get some data from a RESTful web service (that runs locally on my machine). The code snippet in question is: Client client = ClientBuilder.newClient(); WebTarget target = client.target(http://localhost:/rest;); target = target.path(annotate) .queryParam(text, UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission)) .queryParam(confidence, 0.3); logger.warn(!!! DEBUG !!! target: {}, target.getUri().toString()); String response = target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class); logger.warn(!!! DEBUG !!! Spotlight response: {}, response); When run inside a unit test as follows: mvn clean test -Dtest=SpotlightTest#testCountWords it contacts the RESTful web service and retrieves some data as expected. But when the same code is run as part of the application that is submitted to Spark, using spark-submit script I receive the following error: java.lang.NoSuchMethodError: javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey in my project's pom.xml: dependency groupIdorg.glassfish.jersey.containers/groupId artifactIdjersey-container-servlet-core/artifactId version2.14/version /dependency So I suspect that when the application is submitted to Spark, somehow there's a different JAR in the environment that uses a different version of Jersey / javax.ws.rs.* Does anybody know which version of Jersey / javax.ws.rs.* is used in the Spark environment, or how to solve this conflict? -- Emre Sevinç https://be.linkedin.com/in/emresevinc/ -- Emre Sevinc
Re: Single worker locked at 100% CPU
Turns out that I was just being idiotic and had assigned so much memory to Spark that the O/S was ending up continually swapping. Apologies for the noise. Phil On Wed, Dec 24, 2014 at 1:16 AM, Andrew Ash and...@andrewash.com wrote: Hi Phil, This sounds a lot like a deadlock in Hadoop's Configuration object that I ran into a while back. If you jstack the JVM and see a thread that looks like the below, it could be https://issues.apache.org/jira/browse/SPARK-2546 Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 nid=0x54b1 runnable [0x7f92d74f1000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.transfer(HashMap.java:601) at java.util.HashMap.resize(HashMap.java:581) at java.util.HashMap.addEntry(HashMap.java:879) at java.util.HashMap.put(HashMap.java:505) at org.apache.hadoop.conf.Configuration.set(Configuration.java:803) at org.apache.hadoop.conf.Configuration.set(Configuration.java:783) at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662) The fix for this issue is hidden behind a flag because it might have performance implications, but if it is this problem then you can set spark.hadoop.cloneConf=true and see if that fixes things. Good luck! Andrew On Tue, Dec 23, 2014 at 9:40 AM, Phil Wills otherp...@gmail.com wrote: I've been attempting to run a job based on MLlib's ALS implementation for a while now and have hit an issue I'm having a lot of difficulty getting to the bottom of. On a moderate size set of input data it works fine, but against larger (still well short of what I'd think of as big) sets of data, I'll see one or two workers get stuck spinning at 100% CPU and the job unable to recover. I don't believe this is down to memory pressure as I seem to get the same behaviour at about the same size of input data, even if the cluster is twice as large. GC logs also suggest things are proceeding reasonably with some Full GC's occurring, but no suggestion of the process being GC locked. After rebooting the instance that got into trouble, I can see the stderr log for the task truncated in the middle of a log-line at the time CPU shoots to and sticks at 100%, but no other signs of a problem. I've run into the same issue on 1.1.0 and 1.2.0 in standalone mode and running on YARN. Any suggestions on further steps I could try to get a clearer diagnosis of the issue would be much appreciated. Thanks, Phil
Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?
That could well be it -- oops, I forgot to run with the YARN profile and so didn't see the YARN dependencies. Try the userClassPathFirst option to try to make your app's copy take precedence. The second error is really a JVM bug, but, is from having too little memory available for the unit tests. http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage On Wed, Dec 24, 2014 at 1:56 PM, Emre Sevinc emre.sev...@gmail.com wrote: It seems like YARN depends an older version of Jersey, that is 1.9: https://github.com/apache/spark/blob/master/yarn/pom.xml When I've modified my dependencies to have only: dependency groupIdcom.sun.jersey/groupId artifactIdjersey-core/artifactId version1.9.1/version /dependency And then modified the code to use the old Jersey API: Client c = Client.create(); WebResource r = c.resource(http://localhost:/rest;) .path(annotate) .queryParam(text, UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission)) .queryParam(confidence, 0.3); logger.warn(!!! DEBUG !!! target: {}, r.getURI()); String response = r.accept(MediaType.APPLICATION_JSON_TYPE) //.header() .get(String.class); logger.warn(!!! DEBUG !!! Spotlight response: {}, response); It seems to work when I use spark-submit to submit the application that includes this code. Funny thing is, now my relevant unit test does not run, complaining about not having enough memory: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0xc490, 25165824, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 25165824 bytes for committing reserved memory. -- Emre On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Your guess is right, that there are two incompatible versions of Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey, but its transitive dependencies may, or your transitive dependencies may. I don't see Jersey in Spark's dependency tree except from HBase tests, which in turn only appear in examples, so that's unlikely to be it. I'd take a look with 'mvn dependency:tree' on your own code first. Maybe you are including JavaEE 6 for example? On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, I have a piece of code that runs inside Spark Streaming and tries to get some data from a RESTful web service (that runs locally on my machine). The code snippet in question is: Client client = ClientBuilder.newClient(); WebTarget target = client.target(http://localhost:/rest;); target = target.path(annotate) .queryParam(text, UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission)) .queryParam(confidence, 0.3); logger.warn(!!! DEBUG !!! target: {}, target.getUri().toString()); String response = target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class); logger.warn(!!! DEBUG !!! Spotlight response: {}, response); When run inside a unit test as follows: mvn clean test -Dtest=SpotlightTest#testCountWords it contacts the RESTful web service and retrieves some data as expected. But when the same code is run as part of the application that is submitted to Spark, using spark-submit script I receive the following error: java.lang.NoSuchMethodError: javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey in my project's pom.xml: dependency groupIdorg.glassfish.jersey.containers/groupId artifactIdjersey-container-servlet-core/artifactId version2.14/version /dependency So I suspect that when the application is submitted to Spark, somehow there's a different JAR in the environment that uses a different version of Jersey / javax.ws.rs.* Does anybody know which version of Jersey / javax.ws.rs.* is used in the Spark environment, or how to solve this conflict? -- Emre Sevinç https://be.linkedin.com/in/emresevinc/ -- Emre Sevinc - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why does consuming a RESTful web service (using javax.ws.rs.* and Jsersey) work in unit test but not when submitted to Spark?
Sean, Thanks a lot for the important information, especially userClassPathFirst. Cheers, Emre On Wed, Dec 24, 2014 at 3:38 PM, Sean Owen so...@cloudera.com wrote: That could well be it -- oops, I forgot to run with the YARN profile and so didn't see the YARN dependencies. Try the userClassPathFirst option to try to make your app's copy take precedence. The second error is really a JVM bug, but, is from having too little memory available for the unit tests. http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage On Wed, Dec 24, 2014 at 1:56 PM, Emre Sevinc emre.sev...@gmail.com wrote: It seems like YARN depends an older version of Jersey, that is 1.9: https://github.com/apache/spark/blob/master/yarn/pom.xml When I've modified my dependencies to have only: dependency groupIdcom.sun.jersey/groupId artifactIdjersey-core/artifactId version1.9.1/version /dependency And then modified the code to use the old Jersey API: Client c = Client.create(); WebResource r = c.resource(http://localhost:/rest;) .path(annotate) .queryParam(text, UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission)) .queryParam(confidence, 0.3); logger.warn(!!! DEBUG !!! target: {}, r.getURI()); String response = r.accept(MediaType.APPLICATION_JSON_TYPE) //.header() .get(String.class); logger.warn(!!! DEBUG !!! Spotlight response: {}, response); It seems to work when I use spark-submit to submit the application that includes this code. Funny thing is, now my relevant unit test does not run, complaining about not having enough memory: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0xc490, 25165824, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 25165824 bytes for committing reserved memory. -- Emre On Wed, Dec 24, 2014 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Your guess is right, that there are two incompatible versions of Jersey (or really, JAX-RS) in your runtime. Spark doesn't use Jersey, but its transitive dependencies may, or your transitive dependencies may. I don't see Jersey in Spark's dependency tree except from HBase tests, which in turn only appear in examples, so that's unlikely to be it. I'd take a look with 'mvn dependency:tree' on your own code first. Maybe you are including JavaEE 6 for example? On Wed, Dec 24, 2014 at 12:02 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, I have a piece of code that runs inside Spark Streaming and tries to get some data from a RESTful web service (that runs locally on my machine). The code snippet in question is: Client client = ClientBuilder.newClient(); WebTarget target = client.target(http://localhost:/rest;); target = target.path(annotate) .queryParam(text, UrlEscapers.urlFragmentEscaper().escape(spotlightSubmission)) .queryParam(confidence, 0.3); logger.warn(!!! DEBUG !!! target: {}, target.getUri().toString()); String response = target.request().accept(MediaType.APPLICATION_JSON_TYPE).get(String.class); logger.warn(!!! DEBUG !!! Spotlight response: {}, response); When run inside a unit test as follows: mvn clean test -Dtest=SpotlightTest#testCountWords it contacts the RESTful web service and retrieves some data as expected. But when the same code is run as part of the application that is submitted to Spark, using spark-submit script I receive the following error: java.lang.NoSuchMethodError: javax.ws.rs.core.MultivaluedMap.addAll(Ljava/lang/Object;[Ljava/lang/Object;)V I'm using Spark 1.1.0 and for consuming the web service I'm using Jersey in my project's pom.xml: dependency groupIdorg.glassfish.jersey.containers/groupId artifactIdjersey-container-servlet-core/artifactId version2.14/version /dependency So I suspect that when the application is submitted to Spark, somehow there's a different JAR in the environment that uses a different version of Jersey / javax.ws.rs.* Does anybody know which version of Jersey / javax.ws.rs.* is used in the Spark environment, or how to solve this conflict? -- Emre Sevinç https://be.linkedin.com/in/emresevinc/ -- Emre Sevinc -- Emre Sevinc
SVDPlusPlus Recommender in MLLib
hi , Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is implemented in Mahout from this paper - http://research.yahoo.com/files/kdd08koren.pdf http://research.yahoo.com/files/kdd08koren.pdf Regards, Prafulla.
Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0
bq. even when testing with the example from the stock hbase_outputformat.py Can you take jstack of the above and pastebin it ? Thanks On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?) using hbase 0.98.6 and yarn-client mode. thanks, Antony.
null Error in ALS model predict
Hi all!, I have a RDD[(int,int,double,double)] where the first two int values are id and product, respectively. I trained an implicit ALS algorithm and want to make predictions from this RDD. I make two things but I think both ways are same. 1- Convert this RDD to RDD[(int,int)] and use model.predict(RDD(int,int)), this works to me! 2- Make a map and apply model.predict(int,int), for example: val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating, resp)= model.predict(id,rubro) } Where ratings is a RDD[Double]. Now, the second way when I apply a ratings.first() I get the follow error: Why this happend? I need to use this second way. Thanks in advance, Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com http://www.exalitica.com/ www.exalitica.com http://exalitica.com/web/img/frim.png
How to identify erroneous input record ?
hey guys One of my input records has an problem that makes the code fail. var demoRddFilter = demoRdd.filter(line = !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || !line.contains(primaryid$caseid$caseversion)) var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ + line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12))demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + outFile) This is possibly happening because perhaps one input record may not have 13 fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test the number of fields of each line before applying the map function2. enclose the mapping function in a try catch block so that the mapping function only fails for the erroneous recordHow do I implement 1. or 2. in the Spark code ?Thanks sanjay !--#yiv7202296517 _filtered #yiv7202296517 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv7202296517 {font-family:Cambria Math;panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv7202296517 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv7202296517 #yiv7202296517 p.yiv7202296517MsoNormal, #yiv7202296517 li.yiv7202296517MsoNormal, #yiv7202296517 div.yiv7202296517MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri, sans-serif;}#yiv7202296517 a:link, #yiv7202296517 span.yiv7202296517MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv7202296517 a:visited, #yiv7202296517 span.yiv7202296517MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv7202296517 p.yiv7202296517MsoListParagraph, #yiv7202296517 li.yiv7202296517MsoListParagraph, #yiv7202296517 div.yiv7202296517MsoListParagraph {margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;font-family:Calibri, sans-serif;}#yiv7202296517 span.yiv7202296517EstiloCorreo17 {font-family:Calibri, sans-serif;color:windowtext;}#yiv7202296517 .yiv7202296517MsoChpDefault {font-family:Calibri, sans-serif;} _filtered #yiv7202296517 {margin:70.85pt 3.0cm 70.85pt 3.0cm;}#yiv7202296517 div.yiv7202296517WordSection1 {}#yiv7202296517 _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {} _filtered #yiv7202296517 {}#yiv7202296517 ol {margin-bottom:0cm;}#yiv7202296517 ul {margin-bottom:0cm;}--
Re: How to identify erroneous input record ?
DOH Looks like I did not have enough coffee before I asked this :-) I added the if statement...var demoRddFilter = demoRdd.filter(line = !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || !line.contains(primaryid$caseid$caseversion)) var demoRddFilterMap = demoRddFilter.map(line = { if (line.split('$').length = 13){ line.split('$')(0) + ~ + line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12) } }) From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID To: user@spark.apache.org user@spark.apache.org Sent: Wednesday, December 24, 2014 8:28 AM Subject: How to identify erroneous input record ? hey guys One of my input records has an problem that makes the code fail. var demoRddFilter = demoRdd.filter(line = !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || !line.contains(primaryid$caseid$caseversion)) var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ + line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12))demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + outFile) This is possibly happening because perhaps one input record may not have 13 fields.If this were Hadoop mapper code , I have 2 ways to solve this 1. test the number of fields of each line before applying the map function2. enclose the mapping function in a try catch block so that the mapping function only fails for the erroneous recordHow do I implement 1. or 2. in the Spark code ?Thanks sanjay #yiv8750085330 #yiv8750085330 -- filtered {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;}#yiv8750085330 filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv8750085330 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv8750085330 p.yiv8750085330MsoNormal, #yiv8750085330 li.yiv8750085330MsoNormal, #yiv8750085330 div.yiv8750085330MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330 a:link, #yiv8750085330 span.yiv8750085330MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv8750085330 a:visited, #yiv8750085330 span.yiv8750085330MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv8750085330 p.yiv8750085330MsoListParagraph, #yiv8750085330 li.yiv8750085330MsoListParagraph, #yiv8750085330 div.yiv8750085330MsoListParagraph {margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8750085330 span.yiv8750085330EstiloCorreo17 {color:windowtext;}#yiv8750085330 .yiv8750085330MsoChpDefault {}#yiv8750085330 filtered {margin:70.85pt 3.0cm 70.85pt 3.0cm;}#yiv8750085330 div.yiv8750085330WordSection1 {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 filtered {}#yiv8750085330 ol {margin-bottom:0cm;}#yiv8750085330 ul {margin-bottom:0cm;}#yiv8750085330
Re: How to identify erroneous input record ?
I don't believe that works since your map function does not return a value for lines shorter than 13 tokens. You should use flatMap and Some/None. (You probably want to not parse the string 5 times too.) val demoRddFilterMap = demoRddFilter.flatMap { line = val tokens = line.split('$') if (tokens.length = 13) { val parsed = tokens(0) + ~ + tokens(5) + ~ + tokens(11) + ~ + tokens(12) Some(parsed) } else { None } } On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: DOH Looks like I did not have enough coffee before I asked this :-) I added the if statement... var demoRddFilter = demoRdd.filter(line = !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || !line.contains(primaryid$caseid$caseversion)) var demoRddFilterMap = demoRddFilter.map(line = { if (line.split('$').length = 13){ line.split('$')(0) + ~ + line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12) } }) From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID To: user@spark.apache.org user@spark.apache.org Sent: Wednesday, December 24, 2014 8:28 AM Subject: How to identify erroneous input record ? hey guys One of my input records has an problem that makes the code fail. var demoRddFilter = demoRdd.filter(line = !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || !line.contains(primaryid$caseid$caseversion)) var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ + line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12)) demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + outFile) This is possibly happening because perhaps one input record may not have 13 fields. If this were Hadoop mapper code , I have 2 ways to solve this 1. test the number of fields of each line before applying the map function 2. enclose the mapping function in a try catch block so that the mapping function only fails for the erroneous record How do I implement 1. or 2. in the Spark code ? Thanks sanjay - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to identify erroneous input record ?
Although not elegantly I got the output via my code but totally agree on the parsing 5 times (thats really bad).Will add your suggested logic and check it out. I have a long way to the finish line. I am re-architecting my entire hadoop code and getting it onto spark. Check out what I do at www.medicalsidefx.orgPrimarily an iPhone app but underlying is Lucene, Hadoop and hopefully soon in 2015 - Spark :-) From: Sean Owen so...@cloudera.com To: Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Sent: Wednesday, December 24, 2014 8:56 AM Subject: Re: How to identify erroneous input record ? I don't believe that works since your map function does not return a value for lines shorter than 13 tokens. You should use flatMap and Some/None. (You probably want to not parse the string 5 times too.) val demoRddFilterMap = demoRddFilter.flatMap { line = val tokens = line.split('$') if (tokens.length = 13) { val parsed = tokens(0) + ~ + tokens(5) + ~ + tokens(11) + ~ + tokens(12) Some(parsed) } else { None } } On Wed, Dec 24, 2014 at 4:35 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: DOH Looks like I did not have enough coffee before I asked this :-) I added the if statement... var demoRddFilter = demoRdd.filter(line = !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || !line.contains(primaryid$caseid$caseversion)) var demoRddFilterMap = demoRddFilter.map(line = { if (line.split('$').length = 13){ line.split('$')(0) + ~ + line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12) } }) From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID To: user@spark.apache.org user@spark.apache.org Sent: Wednesday, December 24, 2014 8:28 AM Subject: How to identify erroneous input record ? hey guys One of my input records has an problem that makes the code fail. var demoRddFilter = demoRdd.filter(line = !line.contains(ISR$CASE$I_F_COD$FOLL_SEQ) || !line.contains(primaryid$caseid$caseversion)) var demoRddFilterMap = demoRddFilter.map(line = line.split('$')(0) + ~ + line.split('$')(5) + ~ + line.split('$')(11) + ~ + line.split('$')(12)) demoRddFilterMap.saveAsTextFile(/data/aers/msfx/demo/ + outFile) This is possibly happening because perhaps one input record may not have 13 fields. If this were Hadoop mapper code , I have 2 ways to solve this 1. test the number of fields of each line before applying the map function 2. enclose the mapping function in a try catch block so that the mapping function only fails for the erroneous record How do I implement 1. or 2. in the Spark code ? Thanks sanjay - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
hiveContext.jsonFile fails with Unexpected close marker
I have generally been impressed with the way jsonFile eats just about any json data model.. but getting this error when i try to ingest this file: Unexpected close marker ']': expected '} Here are the commands from the pyspark shell: from pyspark.sql import HiveContext hiveContext = HiveContext(sc) f = hiveContext.jsonFile(sample.json) Here is some sample json: {wf_session: [ {id:6021fb91-c9ec-4019-9ab9-f628aee8d259,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:31.373Z,screen:x,type:1,time_left_secs:1}, {id:7e696c19-3ba4-4469-be28-5ef1f0c03d63,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.385Z,screen:x,type:2,ad_unit_id:null,spot_started_at:2014-12-19T15:55:12.364Z,spot_ended_at:2014-12-19T15:55:32.385Z,spot_duration_secs:20,impression_count:0,impressions:[],engagement_count:0,engagements:[]}, {id:68a43006-09bc-4c18-af55-1ebdc0e041a3,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.375Z,screen:x,type:3,duration_secs:20,to_ad_unit_id:developmentbea1f3a4-be08-4119-b9f4-7} ] } Any help would be appreciated! :) Merry Xmas!
RE: Not Serializable exception when integrating SQL and Spark Streaming
Thanks for the reply. I am testing this with a small amount of data and what is happening is when ever there is data in the Kafka topic Spark does not through Exception otherwise it is. ThanksTarun Date: Wed, 24 Dec 2014 16:23:30 +0800 From: lian.cs@gmail.com To: bigdat...@live.com; user@spark.apache.org Subject: Re: Not Serializable exception when integrating SQL and Spark Streaming Generally you can use -Dsun.io.serialization.extendedDebugInfo=true to enable serialization debugging information when serialization exceptions are raised. On 12/24/14 1:32 PM, bigdata4u wrote: I am trying to use sql over Spark streaming using Java. But i am getting Serialization Exception. public static void main(String args[]) { SparkConf sparkConf = new SparkConf().setAppName(NumberCount); JavaSparkContext jc = new JavaSparkContext(sparkConf); JavaStreamingContext jssc = new JavaStreamingContext(jc, new Duration(2000)); jssc.addStreamingListener(new WorkCountMonitor()); int numThreads = Integer.parseInt(args[3]); MapString,Integer topicMap = new HashMapString,Integer(); String[] topics = args[2].split(,); for (String topic : topics) { topicMap.put(topic, numThreads); } JavaPairReceiverInputDStreamString,String data = KafkaUtils.createStream(jssc, args[0], args[1], topicMap); data.print(); JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String, String, Person() { public Person call(Tuple2String,String v1) throws Exception { String[] stringArray = v1._2.split(,); Person Person = new Person(); Person.setName(stringArray[0]); Person.setAge(stringArray[1]); return Person; } }); final JavaSQLContext sqlContext = new JavaSQLContext(jc); streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() { public Void call(JavaRDDPerson rdd) { JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd, Person.class); subscriberSchema.registerAsTable(people); System.out.println(all data); JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people); System.out.println(afterwards); ListString males = new ArrayListString(); males = names.map(new FunctionRow,String() { public String call(Row row) { return row.getString(0); } }).collect(); System.out.println(before for); for (String name : males) { System.out.println(name); } return null; } }); jssc.start(); jssc.awaitTermination(); Exception is 14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job 1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at org.apache.spark.rdd.RDD.map(RDD.scala:271) at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at com.basic.spark.NumberCount$2.call(NumberCount.java:79) at com.basic.spark.NumberCount$2.call(NumberCount.java:67) at org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274) at org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274) at org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529) at org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529) at org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.io.NotSerializableException: org.apache.spark.sql.api.java.JavaSQLContext at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) at
Re: null Error in ALS model predict
Hi, The MatrixFactorizationModel consists of two RDD's. When you use the second method, Spark tries to serialize both RDD's for the .map() function, which is not possible, because RDD's are not serializable. Therefore you receive the NULLPointerException. You must use the first method. Best, Burak - Original Message - From: Franco Barrientos franco.barrien...@exalitica.com To: user@spark.apache.org Sent: Wednesday, December 24, 2014 7:44:24 AM Subject: null Error in ALS model predict Hi all!, I have a RDD[(int,int,double,double)] where the first two int values are id and product, respectively. I trained an implicit ALS algorithm and want to make predictions from this RDD. I make two things but I think both ways are same. 1- Convert this RDD to RDD[(int,int)] and use model.predict(RDD(int,int)), this works to me! 2- Make a map and apply model.predict(int,int), for example: val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating, resp)= model.predict(id,rubro) } Where ratings is a RDD[Double]. Now, the second way when I apply a ratings.first() I get the follow error: Why this happend? I need to use this second way. Thanks in advance, Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com http://www.exalitica.com/ www.exalitica.com http://exalitica.com/web/img/frim.png
Re: Spark metrics for ganglia
Did you get past this issue? I¹m trying to get this to work as well. You have to compile in the spark-ganglia-lgpl artifact into your application. dependency groupIdorg.apache.spark/groupId artifactIdspark-ganglia-lgpl_2.10/artifactId version${project.version}/version /dependency So I added the above snippet to the examples project, and it finds the class now when I try to run the Pi example, but I get this problem instead: 14/12/24 11:47:23 ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.GangliaSink cannot be instantialized ŠSNIPŠ Caused by: java.lang.NumberFormatException: For input string: 1 On 12/15/14, 11:29 AM, danilopds danilob...@gmail.com wrote: Thanks tsingfu, I used this configuration based in your post: (with ganglia unicast mode) # Enable GangliaSink for all instances *.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink *.sink.ganglia.host=10.0.0.7 *.sink.ganglia.port=8649 *.sink.ganglia.period=15 *.sink.ganglia.unit=seconds *.sink.ganglia.ttl=1 *.sink.ganglia.mode=unicast Then, I have the following error now. ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.GangliaSink cannot be instantialized java.lang.ClassNotFoundException: org.apache.spark.metrics.sink.GangliaSink -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-gang lia-tp14335p20690.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Discourse: A proposed alternative to the Spark User list
When people have questions about Spark, there are 2 main places (as far as I can tell) where they ask them: - Stack Overflow, under the apache-spark tag http://stackoverflow.com/questions/tagged/apache-spark - This mailing list The mailing list is valuable as an independent place for discussion that is part of the Spark project itself. Furthermore, it allows for a broader range of discussions than would be allowed on Stack Overflow http://stackoverflow.com/help/dont-ask. As the Spark project has grown in popularity, I see that a few problems have emerged with this mailing list: - It’s hard to follow topics (e.g. Streaming vs. SQL) that you’re interested in, and it’s hard to know when someone has mentioned you specifically. - It’s hard to search for existing threads and link information across disparate threads. - It’s hard to format code and log snippets nicely, and by extension, hard to read other people’s posts with this kind of information. There are existing solutions to all these (and other) problems based around straight-up discipline or client-side tooling, which users have to conjure up for themselves. I’d like us as a community to consider using Discourse http://www.discourse.org/ as an alternative to, or overlay on top of, this mailing list, that provides better out-of-the-box solutions to these problems. Discourse is a modern discussion platform built by some of the same people who created Stack Overflow. It has many neat features http://v1.discourse.org/about/ that I believe this community would benefit from. For example: - When a user starts typing up a new post, they get a panel *showing existing conversations that look similar*, just like on Stack Overflow. - It’s easy to search for posts and link between them. - *Markdown support* is built-in to composer. - You can *specifically mention people* and they will be notified. - Posts can be categorized (e.g. Streaming, SQL, etc.). - There is a built-in option for mailing list support which forwards all activity on the forum to a user’s email address and which allows for creation of new posts via email. What do you think of Discourse as an alternative, more manageable way to discus Spark? There are a few options we can consider: 1. Work with the ASF as well as the Discourse team to allow Discourse to act as an overlay on top of this mailing list https://meta.discourse.org/t/discourse-as-a-front-end-for-existing-asf-mailing-lists/23167?u=nicholaschammas, allowing people to continue to use the mailing list as-is if they want. (This is the toughest but perhaps most attractive option.) 2. Create a new Discourse forum for Spark that is not bound to this user list. This is relatively easy but will effectively fork the community on this list. (We cannot shut down this mailing in favor of one managed by Discourse.) 3. Don’t use Discourse. Just encourage people on this list to post instead on Stack Overflow whenever possible. 4. Something else. What does everyone think? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0
this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK thanks, Antony. On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote: bq. even when testing with the example from the stock hbase_outputformat.py Can you take jstack of the above and pastebin it ? Thanks On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?) using hbase 0.98.6 and yarn-client mode. thanks,Antony.
Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0
I went over the jstack but didn't find any call related to hbase or zookeeper. Do you find anything important in the logs ? Looks like container launcher was waiting for the script to return some result: 1. at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715) 2. at org.apache.hadoop.util.Shell.runCommand(Shell.java:524) On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote: this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK thanks, Antony. On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote: bq. even when testing with the example from the stock hbase_outputformat.py Can you take jstack of the above and pastebin it ? Thanks On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?) using hbase 0.98.6 and yarn-client mode. thanks, Antony.
RE: Not Serializable exception when integrating SQL and Spark Streaming
Thanks I debug this further and below is the cause Caused by: java.io.NotSerializableException: org.apache.spark.sql.api.java.JavaSQLContext- field (class com.basic.spark.NumberCount$2, name: val$sqlContext, type: class org.apache.spark.sql.api.java.JavaSQLContext)- object (class com.basic.spark.NumberCount$2, com.basic.spark.NumberCount$2@69ddbcc7) - field (class com.basic.spark.NumberCount$2$1, name: this$0, type: class com.basic.spark.NumberCount$2)- object (class com.basic.spark.NumberCount$2$1, com.basic.spark.NumberCount$2$1@2524beed) - field (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function) I tried this also http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-objects-declared-and-initialized-outside-the-call-method-of-JavaRDD-td17094.html#a17150 Why there is difference SQLContext is Serializable but JavaSQLContext is not? Spark is designed like this. Thanks Date: Wed, 24 Dec 2014 16:23:30 +0800 From: lian.cs@gmail.com To: bigdat...@live.com; user@spark.apache.org Subject: Re: Not Serializable exception when integrating SQL and Spark Streaming Generally you can use -Dsun.io.serialization.extendedDebugInfo=true to enable serialization debugging information when serialization exceptions are raised. On 12/24/14 1:32 PM, bigdata4u wrote: I am trying to use sql over Spark streaming using Java. But i am getting Serialization Exception. public static void main(String args[]) { SparkConf sparkConf = new SparkConf().setAppName(NumberCount); JavaSparkContext jc = new JavaSparkContext(sparkConf); JavaStreamingContext jssc = new JavaStreamingContext(jc, new Duration(2000)); jssc.addStreamingListener(new WorkCountMonitor()); int numThreads = Integer.parseInt(args[3]); MapString,Integer topicMap = new HashMapString,Integer(); String[] topics = args[2].split(,); for (String topic : topics) { topicMap.put(topic, numThreads); } JavaPairReceiverInputDStreamString,String data = KafkaUtils.createStream(jssc, args[0], args[1], topicMap); data.print(); JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String, String, Person() { public Person call(Tuple2String,String v1) throws Exception { String[] stringArray = v1._2.split(,); Person Person = new Person(); Person.setName(stringArray[0]); Person.setAge(stringArray[1]); return Person; } }); final JavaSQLContext sqlContext = new JavaSQLContext(jc); streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() { public Void call(JavaRDDPerson rdd) { JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd, Person.class); subscriberSchema.registerAsTable(people); System.out.println(all data); JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people); System.out.println(afterwards); ListString males = new ArrayListString(); males = names.map(new FunctionRow,String() { public String call(Row row) { return row.getString(0); } }).collect(); System.out.println(before for); for (String name : males) { System.out.println(name); } return null; } }); jssc.start(); jssc.awaitTermination(); Exception is 14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job 1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at org.apache.spark.rdd.RDD.map(RDD.scala:271) at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at com.basic.spark.NumberCount$2.call(NumberCount.java:79) at com.basic.spark.NumberCount$2.call(NumberCount.java:67) at org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274) at org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274) at org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529) at org.apache.spark.streaming.dstream.DStream$anonfun$foreachRDD$1.apply(DStream.scala:529) at org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$anonfun$1.apply(ForEachDStream.scala:40) at
Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0
I just run it by hand from pyspark shell. here is the steps: pyspark --jars /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar conf = {hbase.zookeeper.quorum: localhost, ... hbase.mapred.outputtable: test,... mapreduce.outputformat.class: org.apache.hadoop.hbase.mapreduce.TableOutputFormat,... mapreduce.job.output.key.class: org.apache.hadoop.hbase.io.ImmutableBytesWritable,... mapreduce.job.output.value.class: org.apache.hadoop.io.Writable} keyConv = org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter valueConv = org.apache.spark.examples.pythonconverters.StringListToPutConverter sc.parallelize([['testkey', 'f1', 'testqual', 'testval']], 1).map(lambda x: (x[0], x)).saveAsNewAPIHadoopDataset(... conf=conf,... keyConverter=keyConv,... valueConverter=valueConv) then it spills few of the INFO level messages about submitting a task etc but then it just hangs. very same code runs ok on spark 1.1.0 - the records gets stored in hbase. thanks,Antony. On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote: I went over the jstack but didn't find any call related to hbase or zookeeper.Do you find anything important in the logs ? Looks like container launcher was waiting for the script to return some result: - at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715) - at org.apache.hadoop.util.Shell.runCommand(Shell.java:524) On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote: this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK thanks, Antony. On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote: bq. even when testing with the example from the stock hbase_outputformat.py Can you take jstack of the above and pastebin it ? Thanks On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?) using hbase 0.98.6 and yarn-client mode. thanks,Antony.
Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0
bq. hbase.zookeeper.quorum: localhost You are running hbase cluster in standalone mode ? Is hbase-client jar in the classpath ? Cheers On Wed, Dec 24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote: I just run it by hand from pyspark shell. here is the steps: pyspark --jars /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar conf = {hbase.zookeeper.quorum: localhost, ... hbase.mapred.outputtable: test, ... mapreduce.outputformat.class: org.apache.hadoop.hbase.mapreduce.TableOutputFormat, ... mapreduce.job.output.key.class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, ... mapreduce.job.output.value.class: org.apache.hadoop.io.Writable} keyConv = org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter valueConv = org.apache.spark.examples.pythonconverters.StringListToPutConverter sc.parallelize([['testkey', 'f1', 'testqual', 'testval']], 1).map(lambda x: (x[0], x)).saveAsNewAPIHadoopDataset( ... conf=conf, ... keyConverter=keyConv, ... valueConverter=valueConv) then it spills few of the INFO level messages about submitting a task etc but then it just hangs. very same code runs ok on spark 1.1.0 - the records gets stored in hbase. thanks, Antony. On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote: I went over the jstack but didn't find any call related to hbase or zookeeper. Do you find anything important in the logs ? Looks like container launcher was waiting for the script to return some result: 1. at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715) 2. at org.apache.hadoop.util.Shell.runCommand(Shell.java:524) On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote: this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK thanks, Antony. On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote: bq. even when testing with the example from the stock hbase_outputformat.py Can you take jstack of the above and pastebin it ? Thanks On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?) using hbase 0.98.6 and yarn-client mode. thanks, Antony.
Re: SchemaRDD to RDD[String]
Hi, On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: I want to convert a schemaRDD into RDD of String. How can we do that? Currently I am doing like this which is not converting correctly no exception but resultant strings are empty here is my code Hehe, this is the most Java-ish Scala code I have ever seen ;-) Having said that, are you sure that your rows are not empty? The code looks correct to me, actually. Tobias
Re: SchemaRDD to RDD[String]
You might also try the following, which I think is equivalent: schemaRDD.map(_.mkString(,)) On Wed, Dec 24, 2014 at 8:12 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: I want to convert a schemaRDD into RDD of String. How can we do that? Currently I am doing like this which is not converting correctly no exception but resultant strings are empty here is my code Hehe, this is the most Java-ish Scala code I have ever seen ;-) Having said that, are you sure that your rows are not empty? The code looks correct to me, actually. Tobias
Re: Escape commas in file names
No, there is not. Can you open a JIRA? On Tue, Dec 23, 2014 at 6:33 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I am trying to load a Parquet file which has a comma in its name. Yes, this is a valid file name in HDFS. However, sqlContext.parquetFile interprets this as a comma-separated list of parquet files. Is there any way to escape the comma so it is treated as part of a single file name? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io
Re: Not Serializable exception when integrating SQL and Spark Streaming
The various spark contexts generally aren't serializable because you can't use them on the executors anyway. We made SQLContext serializable just because it gets pulled into scope more often due to the implicit conversions its contains. You should try marking the variable that holds the context with the annotation @transient. On Wed, Dec 24, 2014 at 7:04 PM, Tarun Garg bigdat...@live.com wrote: Thanks I debug this further and below is the cause Caused by: java.io.NotSerializableException: org.apache.spark.sql.api.java.JavaSQLContext - field (class com.basic.spark.NumberCount$2, name: val$sqlContext, type: class org.apache.spark.sql.api.java.JavaSQLContext) - object (class com.basic.spark.NumberCount$2, com.basic.spark.NumberCount$2@69ddbcc7) - field (class com.basic.spark.NumberCount$2$1, name: this$0, type: class com.basic.spark.NumberCount$2) - object (class com.basic.spark.NumberCount$2$1, com.basic.spark.NumberCount$2$1@2524beed) - field (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function) I tried this also http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-objects-declared-and-initialized-outside-the-call-method-of-JavaRDD-td17094.html#a17150 Why there is difference SQLContext is Serializable but JavaSQLContext is not? Spark is designed like this. Thanks -- Date: Wed, 24 Dec 2014 16:23:30 +0800 From: lian.cs@gmail.com To: bigdat...@live.com; user@spark.apache.org Subject: Re: Not Serializable exception when integrating SQL and Spark Streaming Generally you can use -Dsun.io.serialization.extendedDebugInfo=true to enable serialization debugging information when serialization exceptions are raised. On 12/24/14 1:32 PM, bigdata4u wrote: I am trying to use sql over Spark streaming using Java. But i am getting Serialization Exception. public static void main(String args[]) { SparkConf sparkConf = new SparkConf().setAppName(NumberCount); JavaSparkContext jc = new JavaSparkContext(sparkConf); JavaStreamingContext jssc = new JavaStreamingContext(jc, new Duration(2000)); jssc.addStreamingListener(new WorkCountMonitor()); int numThreads = Integer.parseInt(args[3]); MapString,Integer topicMap = new HashMapString,Integer(); String[] topics = args[2].split(,); for (String topic : topics) { topicMap.put(topic, numThreads); } JavaPairReceiverInputDStreamString,String data = KafkaUtils.createStream(jssc, args[0], args[1], topicMap); data.print(); JavaDStreamPerson streamData = data.map(new FunctionTuple2lt;String, String, Person() { public Person call(Tuple2String,String v1) throws Exception { String[] stringArray = v1._2.split(,); Person Person = new Person(); Person.setName(stringArray[0]); Person.setAge(stringArray[1]); return Person; } }); final JavaSQLContext sqlContext = new JavaSQLContext(jc); streamData.foreachRDD(new FunctionJavaRDDlt;Person,Void() { public Void call(JavaRDDPerson rdd) { JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd, Person.class); subscriberSchema.registerAsTable(people); System.out.println(all data); JavaSchemaRDD names = sqlContext.sql(SELECT name FROM people); System.out.println(afterwards); ListString males = new ArrayListString(); males = names.map(new FunctionRow,String() { public String call(Row row) { return row.getString(0); } }).collect(); System.out.println(before for); for (String name : males) { System.out.println(name); } return null; } }); jssc.start(); jssc.awaitTermination(); Exception is 14/12/23 23:49:38 ERROR JobScheduler: Error running job streaming job 1419378578000 ms.1 org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1435) at org.apache.spark.rdd.RDD.map(RDD.scala:271) at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42) at com.basic.spark.NumberCount$2.call(NumberCount.java:79) at com.basic.spark.NumberCount$2.call(NumberCount.java:67) at org.apache.spark.streaming.api.java.JavaDStreamLike$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274) at
Re: hiveContext.jsonFile fails with Unexpected close marker
Each JSON object needs to be on a single line since this is the boundary the TextFileInputFormat uses when splitting up large files. On Wed, Dec 24, 2014 at 12:34 PM, elliott cordo elliottco...@gmail.com wrote: I have generally been impressed with the way jsonFile eats just about any json data model.. but getting this error when i try to ingest this file: Unexpected close marker ']': expected '} Here are the commands from the pyspark shell: from pyspark.sql import HiveContext hiveContext = HiveContext(sc) f = hiveContext.jsonFile(sample.json) Here is some sample json: {wf_session: [ {id:6021fb91-c9ec-4019-9ab9-f628aee8d259,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:31.373Z,screen:x,type:1,time_left_secs:1}, {id:7e696c19-3ba4-4469-be28-5ef1f0c03d63,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.385Z,screen:x,type:2,ad_unit_id:null,spot_started_at:2014-12-19T15:55:12.364Z,spot_ended_at:2014-12-19T15:55:32.385Z,spot_duration_secs:20,impression_count:0,impressions:[],engagement_count:0,engagements:[]}, {id:68a43006-09bc-4c18-af55-1ebdc0e041a3,machine_id:b45c8c4a-7e8e-442d-8d49-fb7c32e2d813,session_id:d65ca338-c6b8-4bff-93b1-7f2364726fb7,event_at:2014-12-19T15:55:32.375Z,screen:x,type:3,duration_secs:20,to_ad_unit_id:developmentbea1f3a4-be08-4119-b9f4-7} ] } Any help would be appreciated! :) Merry Xmas!
Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0
I am running it in yarn-client mode and I believe hbase-client is part of the spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar which I am submitting at launch. adding another jstack taken during the hanging - http://pastebin.com/QDQrBw70 - this is of the CoarseGrainedExecutorBackend process - this one is referencing hbase and zookeeper. thanks,Antony. On Thursday, 25 December 2014, 1:38, Ted Yu yuzhih...@gmail.com wrote: bq. hbase.zookeeper.quorum: localhost You are running hbase cluster in standalone mode ?Is hbase-client jar in the classpath ? Cheers On Wed, Dec 24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote: I just run it by hand from pyspark shell. here is the steps: pyspark --jars /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar conf = {hbase.zookeeper.quorum: localhost, ... hbase.mapred.outputtable: test,... mapreduce.outputformat.class: org.apache.hadoop.hbase.mapreduce.TableOutputFormat,... mapreduce.job.output.key.class: org.apache.hadoop.hbase.io.ImmutableBytesWritable,... mapreduce.job.output.value.class: org.apache.hadoop.io.Writable} keyConv = org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter valueConv = org.apache.spark.examples.pythonconverters.StringListToPutConverter sc.parallelize([['testkey', 'f1', 'testqual', 'testval']], 1).map(lambda x: (x[0], x)).saveAsNewAPIHadoopDataset(... conf=conf,... keyConverter=keyConv,... valueConverter=valueConv) then it spills few of the INFO level messages about submitting a task etc but then it just hangs. very same code runs ok on spark 1.1.0 - the records gets stored in hbase. thanks,Antony. On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote: I went over the jstack but didn't find any call related to hbase or zookeeper.Do you find anything important in the logs ? Looks like container launcher was waiting for the script to return some result: - at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715) - at org.apache.hadoop.util.Shell.runCommand(Shell.java:524) On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote: this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK thanks, Antony. On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote: bq. even when testing with the example from the stock hbase_outputformat.py Can you take jstack of the above and pastebin it ? Thanks On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?) using hbase 0.98.6 and yarn-client mode. thanks,Antony.
Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0
also hbase itself works ok: hbase(main):006:0 scan 'test'ROW COLUMN+CELL key1 column=f1:asd, timestamp=1419463092904, value=456 1 row(s) in 0.0250 seconds hbase(main):007:0 put 'test', 'testkey', 'f1:testqual', 'testval'0 row(s) in 0.0170 seconds hbase(main):008:0 scan 'test'ROW COLUMN+CELL key1 column=f1:asd, timestamp=1419463092904, value=456 testkey column=f1:testqual, timestamp=1419487275905, value=testval 2 row(s) in 0.0270 seconds On Thursday, 25 December 2014, 6:58, Antony Mayi antonym...@yahoo.com wrote: I am running it in yarn-client mode and I believe hbase-client is part of the spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar which I am submitting at launch. adding another jstack taken during the hanging - http://pastebin.com/QDQrBw70 - this is of the CoarseGrainedExecutorBackend process - this one is referencing hbase and zookeeper. thanks,Antony. On Thursday, 25 December 2014, 1:38, Ted Yu yuzhih...@gmail.com wrote: bq. hbase.zookeeper.quorum: localhost You are running hbase cluster in standalone mode ?Is hbase-client jar in the classpath ? Cheers On Wed, Dec 24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote: I just run it by hand from pyspark shell. here is the steps: pyspark --jars /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar conf = {hbase.zookeeper.quorum: localhost, ... hbase.mapred.outputtable: test,... mapreduce.outputformat.class: org.apache.hadoop.hbase.mapreduce.TableOutputFormat,... mapreduce.job.output.key.class: org.apache.hadoop.hbase.io.ImmutableBytesWritable,... mapreduce.job.output.value.class: org.apache.hadoop.io.Writable} keyConv = org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter valueConv = org.apache.spark.examples.pythonconverters.StringListToPutConverter sc.parallelize([['testkey', 'f1', 'testqual', 'testval']], 1).map(lambda x: (x[0], x)).saveAsNewAPIHadoopDataset(... conf=conf,... keyConverter=keyConv,... valueConverter=valueConv) then it spills few of the INFO level messages about submitting a task etc but then it just hangs. very same code runs ok on spark 1.1.0 - the records gets stored in hbase. thanks,Antony. On Thursday, 25 December 2014, 0:37, Ted Yu yuzhih...@gmail.com wrote: I went over the jstack but didn't find any call related to hbase or zookeeper.Do you find anything important in the logs ? Looks like container launcher was waiting for the script to return some result: - at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:715) - at org.apache.hadoop.util.Shell.runCommand(Shell.java:524) On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote: this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK thanks, Antony. On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com wrote: bq. even when testing with the example from the stock hbase_outputformat.py Can you take jstack of the above and pastebin it ? Thanks On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?) using hbase 0.98.6 and yarn-client mode. thanks,Antony.
Question on saveAsTextFile with overwrite option
Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry
Re: Question on saveAsTextFile with overwrite option
Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Question on saveAsTextFile with overwrite option
I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: user@spark.apache.org; d...@spark.apache.org Subject: Re: Question on saveAsTextFile with overwrite option Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Question on saveAsTextFile with overwrite option
So the behavior of overwriting existing directories IMO is something we don't want to encourage. The reason why the Hadoop client has these checks is that it's very easy for users to do unsafe things without them. For instance, a user could overwrite an RDD that had 100 partitions with an RDD that has 10 partitions... and if they read back the RDD they would get a corrupted RDD that has a combination of data from the old and new RDD. If users want to circumvent these safety checks, we need to make them explicitly disable them. Given this, I think a config option is as reasonable as any alternatives. This is already pretty easy IMO. - Patrick On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote: I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: user@spark.apache.org; d...@spark.apache.org Subject: Re: Question on saveAsTextFile with overwrite option Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Question on saveAsTextFile with overwrite option
Thanks Patrick for your detailed explanation. BR Jerry -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:43 PM To: Cheng, Hao Cc: Shao, Saisai; user@spark.apache.org; d...@spark.apache.org Subject: Re: Question on saveAsTextFile with overwrite option So the behavior of overwriting existing directories IMO is something we don't want to encourage. The reason why the Hadoop client has these checks is that it's very easy for users to do unsafe things without them. For instance, a user could overwrite an RDD that had 100 partitions with an RDD that has 10 partitions... and if they read back the RDD they would get a corrupted RDD that has a combination of data from the old and new RDD. If users want to circumvent these safety checks, we need to make them explicitly disable them. Given this, I think a config option is as reasonable as any alternatives. This is already pretty easy IMO. - Patrick On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote: I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: user@spark.apache.org; d...@spark.apache.org Subject: Re: Question on saveAsTextFile with overwrite option Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-S p ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations? Appreciate your suggestions. Thanks a lot Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to build Spark against the latest
What options should I use when running the make-distribution.sh script, I tried ./make-distribution.sh --hadoop.version 2.6.0 --with-yarn -with-hive --with-tachyon --tgz with nothing came out. Regards -- Original -- From: guxiaobo1982;guxiaobo1...@qq.com; Send time: Wednesday, Dec 24, 2014 6:52 PM To: Ted Yuyuzhih...@gmail.com; Cc: user@spark.apache.orguser@spark.apache.org; Subject: Re: How to build Spark against the latest Hi Ted, The reference command works, but where I can get the deployable binaries? Xiaobo Gu -- Original -- From: Ted Yu;yuzhih...@gmail.com; Send time: Wednesday, Dec 24, 2014 12:09 PM To: guxiaobo1...@qq.com; Cc: user@spark.apache.orguser@spark.apache.org; Subject: Re: How to build Spark against the latest See http://search-hadoop.com/m/JW1q5Cew0j On Tue, Dec 23, 2014 at 8:00 PM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, The official pom.xml file only have profile for hadoop version 2.4 as the latest version, but I installed hadoop version 2.6.0 with ambari, how can I build spark against it, just using mvn -Dhadoop.version=2.6.0, or how to make a coresponding profile for it? Regards, Xiaobo