Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread vasiliy
i posted an example in previous post. Tested on spark 1.0.2, 1.2.0-SNAPSHOT
and 1.1.0 for hadoop 2.4.0 on Windows and Linux servers with hortonworks
hadoop 2.4 in local[4] mode. Any ideas about this spark behavior ?


Akhil Das-2 wrote
 Can you dump out a small piece of data? while doing rdd.collect and
 rdd.foreach(println)
 
 Thanks
 Best Regards
 
 On Wed, Sep 17, 2014 at 12:26 PM, vasiliy lt;

 zadonskiyd@

 gt; wrote:
 
 it also appears in streaming hdfs fileStream



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: 

 user-unsubscribe@.apache

 For additional commands, e-mail: 

 user-help@.apache








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread vasiliy
it also appears in streaming hdfs fileStream



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread vasiliy
full code example:
  def main(args: Array[String]) {
val conf = new
SparkConf().setAppName(ErrorExample).setMaster(local[8])
  .set(spark.serializer, classOf[KryoSerializer].getName)
val sc = new SparkContext(conf)

val rdd = sc.hadoopFile(
hdfs://./user.avro,
  classOf[org.apache.avro.mapred.AvroInputFormat[User]],
  classOf[org.apache.avro.mapred.AvroWrapper[User]],
  classOf[org.apache.hadoop.io.NullWritable],
  1)

val usersRDD = rdd.map({ case (u, _) = u.datum()})
usersRDD.foreach(println)

println(-)

val collected = usersRDD.collect()

collected.foreach(println)
  }


output (without info loggind etc):
{id: 1, name: a}
{id: 2, name: b}
{id: 3, name: c}
{id: 4, name: d}
{id: 5, name: e}
{id: 6, name: f}
-
{id: 6, name: f}
{id: 6, name: f}
{id: 6, name: f}
{id: 6, name: f}
{id: 6, name: f}
{id: 6, name: f}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Thrift JDBC server deployment for production

2014-09-16 Thread vasiliy
it works, thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy
Hello. I have a hadoopFile RDD and i tried to collect items to driver
program, but it returns me an array of identical records (equals to last
record of my file). My code is like this:

val rdd = sc.hadoopFile(
  hdfs:///data.avro,
  classOf[org.apache.avro.mapred.AvroInputFormat[MyAvroRecord]],
  classOf[org.apache.avro.mapred.AvroWrapper[MyAvroRecord]],
  classOf[org.apache.hadoop.io.NullWritable],
  10)


val collectedData = rdd.collect()

for (s - collectedData){
   println(s)
}

it prints wrong data. But rdd.foreach(println) works as expected. 

What wrong with my code and how can i collect the hadoop RDD files (actually
i want collect parts of it) to the driver program ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: can fileStream() or textFileStream() remember state?

2014-09-11 Thread vasiliy
When you get a stream from sc.fileStream() spark will process only files with
file timestamp  then current timestamp so all data from HDFS should not be
processed  again. You may have a another problem - spark will not process
files that moved to your HDFS folder between your application restarts. To
avoid this you should use the checkpoints as discribed here :
https://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node


akso wrote
 When streaming from HDFS through eihter sc.fileStream() or
 sc.textFileStream(), how can state info be saved so that it won't process
 duplicate data.
 When app is stop and restart, all data from HDFS is processed again.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/can-fileStream-or-textFileStream-remember-state-tp9105p13950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org