from:"vasiliy"

Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread vasiliy

i posted an example in previous post. Tested on spark 1.0.2, 1.2.0-SNAPSHOT
and 1.1.0 for hadoop 2.4.0 on Windows and Linux servers with hortonworks
hadoop 2.4 in local[4] mode. Any ideas about this spark behavior ?


Akhil Das-2 wrote
 Can you dump out a small piece of data? while doing rdd.collect and
 rdd.foreach(println)
 
 Thanks
 Best Regards
 
 On Wed, Sep 17, 2014 at 12:26 PM, vasiliy lt;

 zadonskiyd@

 gt; wrote:
 
 it also appears in streaming hdfs fileStream



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: 

 user-unsubscribe@.apache

 For additional commands, e-mail: 

 user-help@.apache








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread vasiliy

it also appears in streaming hdfs fileStream



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread vasiliy

full code example:
  def main(args: Array[String]) {
val conf = new
SparkConf().setAppName(ErrorExample).setMaster(local[8])
  .set(spark.serializer, classOf[KryoSerializer].getName)
val sc = new SparkContext(conf)

val rdd = sc.hadoopFile(
hdfs://./user.avro,
  classOf[org.apache.avro.mapred.AvroInputFormat[User]],
  classOf[org.apache.avro.mapred.AvroWrapper[User]],
  classOf[org.apache.hadoop.io.NullWritable],
  1)

val usersRDD = rdd.map({ case (u, _) = u.datum()})
usersRDD.foreach(println)

println(-)

val collected = usersRDD.collect()

collected.foreach(println)
  }


output (without info loggind etc):
{id: 1, name: a}
{id: 2, name: b}
{id: 3, name: c}
{id: 4, name: d}
{id: 5, name: e}
{id: 6, name: f}
-
{id: 6, name: f}
{id: 6, name: f}
{id: 6, name: f}
{id: 6, name: f}
{id: 6, name: f}
{id: 6, name: f}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-16 Thread vasiliy

it works, thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy

Hello. I have a hadoopFile RDD and i tried to collect items to driver
program, but it returns me an array of identical records (equals to last
record of my file). My code is like this:

val rdd = sc.hadoopFile(
  hdfs:///data.avro,
  classOf[org.apache.avro.mapred.AvroInputFormat[MyAvroRecord]],
  classOf[org.apache.avro.mapred.AvroWrapper[MyAvroRecord]],
  classOf[org.apache.hadoop.io.NullWritable],
  10)


val collectedData = rdd.collect()

for (s - collectedData){
   println(s)
}

it prints wrong data. But rdd.foreach(println) works as expected. 

What wrong with my code and how can i collect the hadoop RDD files (actually
i want collect parts of it) to the driver program ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: can fileStream() or textFileStream() remember state?

2014-09-11 Thread vasiliy

When you get a stream from sc.fileStream() spark will process only files with
file timestamp  then current timestamp so all data from HDFS should not be
processed  again. You may have a another problem - spark will not process
files that moved to your HDFS folder between your application restarts. To
avoid this you should use the checkpoints as discribed here :
https://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node


akso wrote
 When streaming from HDFS through eihter sc.fileStream() or
 sc.textFileStream(), how can state info be saved so that it won't process
 duplicate data.
 When app is stop and restart, all data from HDFS is processed again.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/can-fileStream-or-textFileStream-remember-state-tp9105p13950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: collect on hadoopFile RDD returns wrong results

Re: collect on hadoopFile RDD returns wrong results

Re: collect on hadoopFile RDD returns wrong results

Re: Spark SQL Thrift JDBC server deployment for production

collect on hadoopFile RDD returns wrong results

Re: can fileStream() or textFileStream() remember state?

6 matches

Site Navigation

Mail list logo

Footer information