Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread vasiliy
i posted an example in previous post. Tested on spark 1.0.2, 1.2.0-SNAPSHOT and 1.1.0 for hadoop 2.4.0 on Windows and Linux servers with hortonworks hadoop 2.4 in local[4] mode. Any ideas about this spark behavior ? Akhil Das-2 wrote Can you dump out a small piece of data? while doing

Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread Reynold Xin
This is due to the HadoopRDD (and also the underlying Hadoop InputFormat) reuse objects to avoid allocation. It is sort of tricky to fix. However, in most cases you can clone the records to make sure you are not collecting the same object over and over again.

Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread vasiliy
it also appears in streaming hdfs fileStream -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread Akhil Das
Can you dump out a small piece of data? while doing rdd.collect and rdd.foreach(println) Thanks Best Regards On Wed, Sep 17, 2014 at 12:26 PM, vasiliy zadonsk...@gmail.com wrote: it also appears in streaming hdfs fileStream -- View this message in context:

Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread vasiliy
full code example: def main(args: Array[String]) { val conf = new SparkConf().setAppName(ErrorExample).setMaster(local[8]) .set(spark.serializer, classOf[KryoSerializer].getName) val sc = new SparkContext(conf) val rdd = sc.hadoopFile( hdfs://./user.avro,