Re: collect on hadoopFile RDD returns wrong results
i posted an example in previous post. Tested on spark 1.0.2, 1.2.0-SNAPSHOT and 1.1.0 for hadoop 2.4.0 on Windows and Linux servers with hortonworks hadoop 2.4 in local[4] mode. Any ideas about this spark behavior ? Akhil Das-2 wrote Can you dump out a small piece of data? while doing rdd.collect and rdd.foreach(println) Thanks Best Regards On Wed, Sep 17, 2014 at 12:26 PM, vasiliy lt; zadonskiyd@ gt; wrote: it also appears in streaming hdfs fileStream -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscribe@.apache For additional commands, e-mail: user-help@.apache -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14527.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: collect on hadoopFile RDD returns wrong results
it also appears in streaming hdfs fileStream -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: collect on hadoopFile RDD returns wrong results
full code example: def main(args: Array[String]) { val conf = new SparkConf().setAppName(ErrorExample).setMaster(local[8]) .set(spark.serializer, classOf[KryoSerializer].getName) val sc = new SparkContext(conf) val rdd = sc.hadoopFile( hdfs://./user.avro, classOf[org.apache.avro.mapred.AvroInputFormat[User]], classOf[org.apache.avro.mapred.AvroWrapper[User]], classOf[org.apache.hadoop.io.NullWritable], 1) val usersRDD = rdd.map({ case (u, _) = u.datum()}) usersRDD.foreach(println) println(-) val collected = usersRDD.collect() collected.foreach(println) } output (without info loggind etc): {id: 1, name: a} {id: 2, name: b} {id: 3, name: c} {id: 4, name: d} {id: 5, name: e} {id: 6, name: f} - {id: 6, name: f} {id: 6, name: f} {id: 6, name: f} {id: 6, name: f} {id: 6, name: f} {id: 6, name: f} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14428.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL Thrift JDBC server deployment for production
it works, thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
collect on hadoopFile RDD returns wrong results
Hello. I have a hadoopFile RDD and i tried to collect items to driver program, but it returns me an array of identical records (equals to last record of my file). My code is like this: val rdd = sc.hadoopFile( hdfs:///data.avro, classOf[org.apache.avro.mapred.AvroInputFormat[MyAvroRecord]], classOf[org.apache.avro.mapred.AvroWrapper[MyAvroRecord]], classOf[org.apache.hadoop.io.NullWritable], 10) val collectedData = rdd.collect() for (s - collectedData){ println(s) } it prints wrong data. But rdd.foreach(println) works as expected. What wrong with my code and how can i collect the hadoop RDD files (actually i want collect parts of it) to the driver program ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: can fileStream() or textFileStream() remember state?
When you get a stream from sc.fileStream() spark will process only files with file timestamp then current timestamp so all data from HDFS should not be processed again. You may have a another problem - spark will not process files that moved to your HDFS folder between your application restarts. To avoid this you should use the checkpoints as discribed here : https://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node akso wrote When streaming from HDFS through eihter sc.fileStream() or sc.textFileStream(), how can state info be saved so that it won't process duplicate data. When app is stop and restart, all data from HDFS is processed again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-fileStream-or-textFileStream-remember-state-tp9105p13950.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org