Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu
Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried HiveContext, but the result is exactly the same. ​ On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu wrote: > Hi Spark Users Group, > > I have a csv file to analysis with Spark, but I’m troubling with importing >

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread Igor Berman
show has argument of truncate pass false so it wont truncate your results On 7 February 2016 at 11:01, SLiZn Liu wrote: > Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried > HiveContext, but the result is exactly the same. > ​ > > On Sun, Feb 7, 2016 at

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu
Hi Igor, In my case, it’s not a matter of *truncate*. As the show() function in Spark API doc reads, truncate: Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right… whereas the leading characters of my two columns are

Re: Shuffle memory woes

2016-02-07 Thread Igor Berman
so can you provide code snippets: especially it's interesting to see what are your transformation chain, how many partitions are there on each side of shuffle operation the question is why it can't fit stuff in memory when you are shuffling - maybe your partitioner on "reduce" side is not

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu
*Update*: on local mode(spark-shell --local[2], no matter read from local file system or hdfs) , it works well. But it doesn’t solve this issue, since my data scale requires hundreds of CPU cores and hundreds GB of RAM. BTW, it’s Chinese Tradition New Year now, wish you all have a happy year and

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
As for the second part of your questions- we have a fairly complex join process which requires a ton of stage orchestration from our driver. I've written some code to be able to walk down our DAG tree and execute siblings in the tree concurrently where possible (forcing cache to disk on children

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
Igor, I don't think the question is "why can't it fit stuff in memory". I know why it can't fit stuff in memory- because it's a large dataset that needs to have a reduceByKey() run on it. My understanding is that when it doesn't fit into memory it needs to spill in order to consolidate

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Yuval.Itzchakov
I would definitely try to avoid hosting Kafka and Spark on the same servers. Kafka and Spark will be doing alot of IO between them, so you'll want to maximize on those resources and not share them on the same server. You'll want each Kafka broker to be on a dedicated server, as well as your

Unexpected element type class

2016-02-07 Thread Anoop Shiralige
Hi All, I have written some functions in scala, which I want to expose in pyspark (interactively, spark - 1.6.0). The scala function(loadAvro) writtens a JavaRDD[AvroGenericRecord]. AvroGenericRecord is my wrapper class over the /org.apache.avro.generic.GenericRecord/. I am trying to convert

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi
Fanoos,  Where  you  want the solution to  be deployed ?. On premise or cloud? Regards  Diwakar . Sent from Samsung Mobile. Original message From: "Yuval.Itzchakov" Date:07/02/2016 19:38 (GMT+05:30) To: user@spark.apache.org Cc: Subject: Re: Apache

?????? Shuffle memory woes

2016-02-07 Thread Sea
Hi??Corey?? "The dataset is 100gb at most, the spills can up to 10T-100T", Are your input files lzo format, and you use sc.text() ? If memory is not enough, spark will spill 3-4x of input data to disk. -- -- ??: "Corey

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread أنس الليثي
Diwakar We have our own servers. We will not use any cloud service like Amazon's On 7 February 2016 at 18:24, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Fanoos, > Where you want the solution to be deployed ?. On premise or cloud? > > Regards > Diwakar . > > > > Sent from

Advise on using spark shell for Hive table sql queries

2016-02-07 Thread Mich Talebzadeh
Hi, Pretty new to spark shell. So decided to write this piece of code to get the data from spark shell on Hiver tables. The issue is that I don't really need to define the sqlContext here as I can do a simple command like sql("select count(1) from t") WITHOUT sqlContext. sql("select

Re: spark metrics question

2016-02-07 Thread Matt K
Thanks Takeshi, that's exactly what I was looking for. On Fri, Feb 5, 2016 at 12:32 PM, Takeshi Yamamuro wrote: > How about using `spark.jars` to send jars into a cluster? > > On Sat, Feb 6, 2016 at 12:00 AM, Matt K wrote: > >> Yes. And what I'm

Re: Bad Digest error while doing aws s3 put

2016-02-07 Thread Steve Loughran
> On 7 Feb 2016, at 07:57, Dhimant wrote: > >at > com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:245) >... 15 more > Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The >

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi
We   are using spark in  two  ways  1. Yarn with spark support. Kafka running along with  data nodes  2.  Spark master and workers  running  with  some  of  Kafka brokers.  Data locality is  important. Regards Diwakar  Sent from Samsung Mobile. Original message From: أنس

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
Charles, Thank you for chiming in and I'm glad someone else is experiencing this too and not just me. I know very well how the Spark shuffles work and I've done deep dive presentations @ Spark meetups in the past. This problem is somethng that goes beyond that and, I believe, it exposes a

Re: Spark Streaming with Druid?

2016-02-07 Thread Hemant Bhanawat
You may want to have a look at spark druid project already in progress: https://github.com/SparklineData/spark-druid-olap You can also have a look at SnappyData , which is a low latency store tightly integrated with Spark, Spark SQL and Spark

Re: Shuffle memory woes

2016-02-07 Thread Charles Chao
"The dataset is 100gb at most, the spills can up to 10T-100T" -- I have had the same experiences, although not to this extreme (the spills were < 10T while the input was ~ 100s gb) and haven't found any solution yet. I don't believe this is related to input data format. in my case, I got my

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread Luciano Resende
I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and com.databricks:spark-csv_2.10:1.3.0 with expected results, where the columns seem to be read properly. +--+--+ |C0|C1| +--+--+ |1446566430 | 2015-11-0400:00:30|

Re: Handling Hive Table With large number of rows

2016-02-07 Thread Jörn Franke
Can you provide more details? Your use case does not sound you need Spark. Your version is anyway too old. It does not make sense to develop now with 1.2.1 . There is no "project limitation" that is able to justify this. > On 08 Feb 2016, at 06:48, Meetu Maltiar wrote:

Re: Handling Hive Table With large number of rows

2016-02-07 Thread Meetu Maltiar
Thanks Jörn, We have to construct an XML on HDFS location from couple of Hive tables and they join on one key. The data in both tables we have to join is large. Was wondering for the right approach. XML creation will also be tricky as we cannot hold objects in memory. Old Spark 1.2.1 is a bummer,

Handling Hive Table With large number of rows

2016-02-07 Thread Meetu Maltiar
Hi, I am working on an application that reads a single Hive Table and do some manipulations on each row of it. Finally construct an XML. Hive table will be a large data set, no chance to fit it in memory. I intend to use SparkSQL 1.2.1 (due to project limitations). Any pointers to me on