date:20160207

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu

Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried HiveContext, but the result is exactly the same. On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu wrote: > Hi Spark Users Group, > > I have a csv file to analysis with Spark, but I’m troubling with importing >

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread Igor Berman

show has argument of truncate pass false so it wont truncate your results On 7 February 2016 at 11:01, SLiZn Liu wrote: > Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried > HiveContext, but the result is exactly the same. > > > On Sun, Feb 7, 2016 at

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu

Hi Igor, In my case, it’s not a matter of *truncate*. As the show() function in Spark API doc reads, truncate: Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right… whereas the leading characters of my two columns are

Re: Shuffle memory woes

2016-02-07 Thread Igor Berman

so can you provide code snippets: especially it's interesting to see what are your transformation chain, how many partitions are there on each side of shuffle operation the question is why it can't fit stuff in memory when you are shuffling - maybe your partitioner on "reduce" side is not

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu

*Update*: on local mode(spark-shell --local[2], no matter read from local file system or hdfs) , it works well. But it doesn’t solve this issue, since my data scale requires hundreds of CPU cores and hundreds GB of RAM. BTW, it’s Chinese Tradition New Year now, wish you all have a happy year and

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet

As for the second part of your questions- we have a fairly complex join process which requires a ton of stage orchestration from our driver. I've written some code to be able to walk down our DAG tree and execute siblings in the tree concurrently where possible (forcing cache to disk on children

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet

Igor, I don't think the question is "why can't it fit stuff in memory". I know why it can't fit stuff in memory- because it's a large dataset that needs to have a reduceByKey() run on it. My understanding is that when it doesn't fit into memory it needs to spill in order to consolidate

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Yuval.Itzchakov

I would definitely try to avoid hosting Kafka and Spark on the same servers. Kafka and Spark will be doing alot of IO between them, so you'll want to maximize on those resources and not share them on the same server. You'll want each Kafka broker to be on a dedicated server, as well as your

Unexpected element type class

2016-02-07 Thread Anoop Shiralige

Hi All, I have written some functions in scala, which I want to expose in pyspark (interactively, spark - 1.6.0). The scala function(loadAvro) writtens a JavaRDD[AvroGenericRecord]. AvroGenericRecord is my wrapper class over the /org.apache.avro.generic.GenericRecord/. I am trying to convert

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi

Fanoos, Where you want the solution to be deployed ?. On premise or cloud? Regards Diwakar . Sent from Samsung Mobile. Original message From: "Yuval.Itzchakov" Date:07/02/2016 19:38 (GMT+05:30) To: user@spark.apache.org Cc: Subject: Re: Apache

?????? Shuffle memory woes

2016-02-07 Thread Sea

Hi??Corey?? "The dataset is 100gb at most, the spills can up to 10T-100T", Are your input files lzo format, and you use sc.text() ? If memory is not enough, spark will spill 3-4x of input data to disk. -- -- ??: "Corey

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread أنس الليثي

Diwakar We have our own servers. We will not use any cloud service like Amazon's On 7 February 2016 at 18:24, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Fanoos, > Where you want the solution to be deployed ?. On premise or cloud? > > Regards > Diwakar . > > > > Sent from

Advise on using spark shell for Hive table sql queries

2016-02-07 Thread Mich Talebzadeh

Hi, Pretty new to spark shell. So decided to write this piece of code to get the data from spark shell on Hiver tables. The issue is that I don't really need to define the sqlContext here as I can do a simple command like sql("select count(1) from t") WITHOUT sqlContext. sql("select

Re: spark metrics question

2016-02-07 Thread Matt K

Thanks Takeshi, that's exactly what I was looking for. On Fri, Feb 5, 2016 at 12:32 PM, Takeshi Yamamuro wrote: > How about using `spark.jars` to send jars into a cluster? > > On Sat, Feb 6, 2016 at 12:00 AM, Matt K wrote: > >> Yes. And what I'm

Re: Bad Digest error while doing aws s3 put

2016-02-07 Thread Steve Loughran

> On 7 Feb 2016, at 07:57, Dhimant wrote: > >at > com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:245) >... 15 more > Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The >

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi

We are using spark in two ways 1. Yarn with spark support. Kafka running along with data nodes 2. Spark master and workers running with some of Kafka brokers. Data locality is important. Regards Diwakar Sent from Samsung Mobile. Original message From: أنس

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet

Charles, Thank you for chiming in and I'm glad someone else is experiencing this too and not just me. I know very well how the Spark shuffles work and I've done deep dive presentations @ Spark meetups in the past. This problem is somethng that goes beyond that and, I believe, it exposes a

Re: Spark Streaming with Druid?

2016-02-07 Thread Hemant Bhanawat

You may want to have a look at spark druid project already in progress: https://github.com/SparklineData/spark-druid-olap You can also have a look at SnappyData , which is a low latency store tightly integrated with Spark, Spark SQL and Spark

Re: Shuffle memory woes

2016-02-07 Thread Charles Chao

"The dataset is 100gb at most, the spills can up to 10T-100T" -- I have had the same experiences, although not to this extreme (the spills were < 10T while the input was ~ 100s gb) and haven't found any solution yet. I don't believe this is related to input data format. in my case, I got my

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread Luciano Resende

I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and com.databricks:spark-csv_2.10:1.3.0 with expected results, where the columns seem to be read properly. +--+--+ |C0|C1| +--+--+ |1446566430 | 2015-11-0400:00:30|

Re: Handling Hive Table With large number of rows

2016-02-07 Thread Jörn Franke

Can you provide more details? Your use case does not sound you need Spark. Your version is anyway too old. It does not make sense to develop now with 1.2.1 . There is no "project limitation" that is able to justify this. > On 08 Feb 2016, at 06:48, Meetu Maltiar wrote:

Re: Handling Hive Table With large number of rows

2016-02-07 Thread Meetu Maltiar

Thanks Jörn, We have to construct an XML on HDFS location from couple of Hive tables and they join on one key. The data in both tables we have to join is large. Was wondering for the right approach. XML creation will also be tricky as we cannot hold objects in memory. Old Spark 1.2.1 is a bummer,

Handling Hive Table With large number of rows

2016-02-07 Thread Meetu Maltiar

Hi, I am working on an application that reads a single Hive Table and do some manipulations on each row of it. Finally construct an XML. Hive table will be a large data set, no chance to fit it in memory. I intend to use SparkSQL 1.2.1 (due to project limitations). Any pointers to me on

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Imported CSV file content isn't identical to the original file

Re: Shuffle memory woes

Re: Imported CSV file content isn't identical to the original file

Re: Shuffle memory woes

Re: Shuffle memory woes

Re: Apache Spark data locality when integrating with Kafka

Unexpected element type class

Re: Apache Spark data locality when integrating with Kafka

?????? Shuffle memory woes

Re: Apache Spark data locality when integrating with Kafka

Advise on using spark shell for Hive table sql queries

Re: spark metrics question

Re: Bad Digest error while doing aws s3 put

Re: Apache Spark data locality when integrating with Kafka

Re: Shuffle memory woes

Re: Spark Streaming with Druid?

Re: Shuffle memory woes

Re: Imported CSV file content isn't identical to the original file

Re: Handling Hive Table With large number of rows

Re: Handling Hive Table With large number of rows

Handling Hive Table With large number of rows

23 matches

Site Navigation

Mail list logo

Footer information