Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
the value in (key, value) returned by textFile is exactly one line of the
input.

But what I want is the field between the two “!!”, hope this makes sense.



-
Senior in Tsinghua Univ.
github: http://www.github.com/uronce-cc
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10768.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
Nop. 

My input file's format is:
!!
string1
string2
!!
string3
string4

sc.textFile(path) will return RDD(!!, string1, string2, !!,
string3, string4)

what we need now is to transform this rdd to RDD(string1, string2,
string3, string4)

your solution may not handle this.



-
Senior in Tsinghua Univ.
github: http://www.github.com/uronce-cc
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10777.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread Sean Owen
Oh, you literally mean these are different lines, not the structure of a line.

You can't solve this in general by reading the entire file into one
string. If the input is tens of gigabytes you will probably exhaust
memory on any of your machines. (Or, you might as well not bother with
Spark then.)

Do you really mean you want the strings that aren't !!? that's just
a filter operation. But as I understand you need an RDD of complex
data structures, containing many fields and key-value pairs across
many lines.

This is a difficult format to work with since Hadoop assumes a line is
a record, which is very common, but your records span lines.

If you have many small files, you could use wholeTextFiles to read
entire small text files as a string value, and simply parse it with a
Scala function as normal. That's fine as long as none of the files are
huge.

You can try mapPartitions for larger files, where you can parse an
Iterator[String] instead of a String at a time and combine results
from across lines into an Iterator[YourRecordType]. This would work as
long as Hadoop does not break a file into several partitions, but not
quite if a partition break occurs in your record. If you're willing to
tolerate missing some records here and there, it is a fine scalable
way to do it.


On Mon, Jul 28, 2014 at 12:43 PM, chang cheng myai...@gmail.com wrote:
 Nop.

 My input file's format is:
 !!
 string1
 string2
 !!
 string3
 string4

 sc.textFile(path) will return RDD(!!, string1, string2, !!,
 string3, string4)

 what we need now is to transform this rdd to RDD(string1, string2,
 string3, string4)

 your solution may not handle this.



 -
 Senior in Tsinghua Univ.
 github: http://www.github.com/uronce-cc
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10777.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
Exactly, the fields between !! is a (key, value) customized data structure. 

So, newAPIHadoopFile may be the best practice now. For this specific format,
change the delimiter from default \n to !!\n can be the cheapest, and
this can only be done in hadoop2.x, in hadoop1.x, this can be done by
Implementing a InputFormat although most codes are the same with
TextInputFormat apart from the delimiter. 

This is my first time talking in this mail list and I find you guys are
really nice! Thanks for your discussion with me!



-
Senior in Tsinghua Univ.
github: http://www.github.com/uronce-cc
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10779.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.