Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread vasiliy
i posted an example in previous post. Tested on spark 1.0.2, 1.2.0-SNAPSHOT
and 1.1.0 for hadoop 2.4.0 on Windows and Linux servers with hortonworks
hadoop 2.4 in local[4] mode. Any ideas about this spark behavior ?


Akhil Das-2 wrote
> Can you dump out a small piece of data? while doing rdd.collect and
> rdd.foreach(println)
> 
> Thanks
> Best Regards
> 
> On Wed, Sep 17, 2014 at 12:26 PM, vasiliy <

> zadonskiyd@

> > wrote:
> 
>> it also appears in streaming hdfs fileStream
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>> For additional commands, e-mail: 

> user-help@.apache

>>
>>





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread vasiliy
full code example:
  def main(args: Array[String]) {
val conf = new
SparkConf().setAppName("ErrorExample").setMaster("local[8]")
  .set("spark.serializer", classOf[KryoSerializer].getName)
val sc = new SparkContext(conf)

val rdd = sc.hadoopFile(
"hdfs://./user.avro",
  classOf[org.apache.avro.mapred.AvroInputFormat[User]],
  classOf[org.apache.avro.mapred.AvroWrapper[User]],
  classOf[org.apache.hadoop.io.NullWritable],
  1)

val usersRDD = rdd.map({ case (u, _) => u.datum()})
usersRDD.foreach(println)

println("-")

val collected = usersRDD.collect()

collected.foreach(println)
  }


output (without info loggind etc):
{"id": "1", "name": "a"}
{"id": "2", "name": "b"}
{"id": "3", "name": "c"}
{"id": "4", "name": "d"}
{"id": "5", "name": "e"}
{"id": "6", "name": "f"}
-
{"id": "6", "name": "f"}
{"id": "6", "name": "f"}
{"id": "6", "name": "f"}
{"id": "6", "name": "f"}
{"id": "6", "name": "f"}
{"id": "6", "name": "f"}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy
it also appears in streaming hdfs fileStream



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy
Hello. I have a hadoopFile RDD and i tried to collect items to driver
program, but it returns me an array of identical records (equals to last
record of my file). My code is like this:

val rdd = sc.hadoopFile(
  "hdfs:///data.avro",
  classOf[org.apache.avro.mapred.AvroInputFormat[MyAvroRecord]],
  classOf[org.apache.avro.mapred.AvroWrapper[MyAvroRecord]],
  classOf[org.apache.hadoop.io.NullWritable],
  10)


val collectedData = rdd.collect()

for (s <- collectedData){
   println(s)
}

it prints wrong data. But rdd.foreach(println) works as expected. 

What wrong with my code and how can i collect the hadoop RDD files (actually
i want collect parts of it) to the driver program ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Thrift JDBC server deployment for production

2014-09-16 Thread vasiliy
it works, thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: can fileStream() or textFileStream() remember state?

2014-09-11 Thread vasiliy
When you get a stream from sc.fileStream() spark will process only files with
file timestamp > then current timestamp so all data from HDFS should not be
processed  again. You may have a another problem - spark will not process
files that moved to your HDFS folder between your application restarts. To
avoid this you should use the checkpoints as discribed here :
https://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node


akso wrote
> When streaming from HDFS through eihter sc.fileStream() or
> sc.textFileStream(), how can state info be saved so that it won't process
> duplicate data.
> When app is stop and restart, all data from HDFS is processed again.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/can-fileStream-or-textFileStream-remember-state-tp9105p13950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL Thrift JDBC server deployment for production

2014-09-10 Thread vasiliy
Hi, i have a question about spark sql Thrift JDBC server. 

Is there a best practice for spark SQL deployement ? If i understand right
script 

./sbin/start-thriftserver.sh 

starts Thrift JDBC server in local mode. Is there an script options for
running this server on yarn-cluster mode ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org