Re: collect on hadoopFile RDD returns wrong results
i posted an example in previous post. Tested on spark 1.0.2, 1.2.0-SNAPSHOT and 1.1.0 for hadoop 2.4.0 on Windows and Linux servers with hortonworks hadoop 2.4 in local[4] mode. Any ideas about this spark behavior ? Akhil Das-2 wrote > Can you dump out a small piece of data? while doing rdd.collect and > rdd.foreach(println) > > Thanks > Best Regards > > On Wed, Sep 17, 2014 at 12:26 PM, vasiliy < > zadonskiyd@ > > wrote: > >> it also appears in streaming hdfs fileStream >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: > user-unsubscribe@.apache >> For additional commands, e-mail: > user-help@.apache >> >> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14527.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: collect on hadoopFile RDD returns wrong results
full code example: def main(args: Array[String]) { val conf = new SparkConf().setAppName("ErrorExample").setMaster("local[8]") .set("spark.serializer", classOf[KryoSerializer].getName) val sc = new SparkContext(conf) val rdd = sc.hadoopFile( "hdfs://./user.avro", classOf[org.apache.avro.mapred.AvroInputFormat[User]], classOf[org.apache.avro.mapred.AvroWrapper[User]], classOf[org.apache.hadoop.io.NullWritable], 1) val usersRDD = rdd.map({ case (u, _) => u.datum()}) usersRDD.foreach(println) println("-") val collected = usersRDD.collect() collected.foreach(println) } output (without info loggind etc): {"id": "1", "name": "a"} {"id": "2", "name": "b"} {"id": "3", "name": "c"} {"id": "4", "name": "d"} {"id": "5", "name": "e"} {"id": "6", "name": "f"} - {"id": "6", "name": "f"} {"id": "6", "name": "f"} {"id": "6", "name": "f"} {"id": "6", "name": "f"} {"id": "6", "name": "f"} {"id": "6", "name": "f"} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14428.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: collect on hadoopFile RDD returns wrong results
it also appears in streaming hdfs fileStream -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
collect on hadoopFile RDD returns wrong results
Hello. I have a hadoopFile RDD and i tried to collect items to driver program, but it returns me an array of identical records (equals to last record of my file). My code is like this: val rdd = sc.hadoopFile( "hdfs:///data.avro", classOf[org.apache.avro.mapred.AvroInputFormat[MyAvroRecord]], classOf[org.apache.avro.mapred.AvroWrapper[MyAvroRecord]], classOf[org.apache.hadoop.io.NullWritable], 10) val collectedData = rdd.collect() for (s <- collectedData){ println(s) } it prints wrong data. But rdd.foreach(println) works as expected. What wrong with my code and how can i collect the hadoop RDD files (actually i want collect parts of it) to the driver program ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL Thrift JDBC server deployment for production
it works, thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: can fileStream() or textFileStream() remember state?
When you get a stream from sc.fileStream() spark will process only files with file timestamp > then current timestamp so all data from HDFS should not be processed again. You may have a another problem - spark will not process files that moved to your HDFS folder between your application restarts. To avoid this you should use the checkpoints as discribed here : https://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node akso wrote > When streaming from HDFS through eihter sc.fileStream() or > sc.textFileStream(), how can state info be saved so that it won't process > duplicate data. > When app is stop and restart, all data from HDFS is processed again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-fileStream-or-textFileStream-remember-state-tp9105p13950.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL Thrift JDBC server deployment for production
Hi, i have a question about spark sql Thrift JDBC server. Is there a best practice for spark SQL deployement ? If i understand right script ./sbin/start-thriftserver.sh starts Thrift JDBC server in local mode. Is there an script options for running this server on yarn-cluster mode ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org