Re: [Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Cosmin Posteuca
Hi Egor, About the first problem i think you are right, it's make sense. About the second problem, i check available resource on 8088 port and there show 16 available cores. I start my job with 4 executors with 1 core each, and 1gb per executor. My job use maximum 50mb of memory(just for test).

Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread nancy henry
Hi, How to set this parameters while launching spark shell spark.shuffle.memoryFraction=0.5 and spark.yarn.executor.memoryOverhead=1024 I tried giving like this but I am giving below error spark-shell --master yarn --deploy-mode client --driver-memory 16G --num-executors 500 executor-cores

Re: Strange behavior with 'not' and filter pushdown

2017-02-13 Thread Everett Anderson
Went ahead and opened https://issues.apache.org/jira/browse/SPARK-19586 though I'd generally expect to just close it as fixed in 2.1.0 and roll on. On Sat, Feb 11, 2017 at 5:01 PM, Everett Anderson wrote: > On the plus side, looks like this may be fixed in 2.1.0: > > ==

Re: Case class with POJO - encoder issues

2017-02-13 Thread Michael Armbrust
You are right, you need that PR. I pinged the author, but otherwise it would be great if someone could carry it over the finish line. On Sat, Feb 11, 2017 at 4:19 PM, Jason White wrote: > I'd like to create a Dataset using some classes from Geotools to do some >

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-13 Thread Nick Pentreath
The original Uber authors provided this performance test result: https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro This was for MinHash only though, so it's not clear about what the scalability is for the other metric types. The SignRandomProjectionLSH is not yet in

How to specify default value for StructField?

2017-02-13 Thread vbegar
Hello, I specified a StructType like this: *val mySchema = StructType(Array(StructField("f1", StringType, true),StructField("f2", StringType, true)))* I have many ORC files stored in HDFS location:* /user/hos/orc_files_test_together * These files use different schema : some of them have only

Re: Spark 2.1.0 issue with spark-shell and pyspark

2017-02-13 Thread jerrytim
I came across the same problem while I ran my code at "model.save(sc, path)" Error info: IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':" My platform is Mac, I installed Spark with Hadoop prebuilt. Then I integrated PySpark with Jupyter.

Re: [Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Egor Pahomov
About second problem: I understand this can be in two cases: when one job prevents the other one from getting resources for executors or (2) bottleneck is reading from disk, so you can not really parallel that. I have no experience with second case, but it's easy to verify the fist one: just look

Re: [Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Egor Pahomov
"But if i increase only executor-cores the finish time is the same". More experienced ones can correct me, if I'm wrong, but as far as I understand that: one partition processed by one spark task. Task is always running on 1 core and not parallelized among cores. So if you have 5 partitions and

using spark-xml_2.10 to extract data from XML file

2017-02-13 Thread Carlo . Allocca
Dear All, I am using spark-xml_2.10 to parse and extract some data from XML files. I got the issue of getting null value whereas the XML file contains actually values.

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-13 Thread Spark User
How much memory have you allocated to the driver? Driver stores some state for tracking the task, stage and job history that you can see in the spark console, it does take up a significant portion of the heap, anywhere from 200MB - 1G, depending no your map reduce steps. Either way that is a good

Re: Question about best Spark tuning

2017-02-13 Thread Spark User
My take on the 2-3 tasks per CPU core is that you want to ensure you are utilizing the cores to the max, which means it will help you with scaling and performance. The question would be why not 1 task per core? The reason is that you can probably get a good handle on the average execution time per

Re: Is it better to Use Java or Python on Scala for Spark for using big data sets

2017-02-13 Thread Spark User
Spark has more support for scala, by that I mean more APIs are available for scala compared to python or Java. Also scala code will be more concise and easy to read. Java is very verbose. On Thu, Feb 9, 2017 at 10:21 PM, Irving Duran wrote: > I would say Java, since it

Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread Thakrar, Jayesh
Nancy, As your log output indicated, your executor 11 GB memory limit. While you might want to address the root cause/data volume as suggested by Jon, you can do an immediate test by changing your command as follows spark-shell --master yarn --deploy-mode client --driver-memory 16G

Re: Parquet Gzipped Files

2017-02-13 Thread Jörn Franke
Your vendor should use the parquet internal compression and not take a parquet file and gzip it. > On 13 Feb 2017, at 18:48, Benjamin Kim wrote: > > We are receiving files from an outside vendor who creates a Parquet data file > and Gzips it before delivery. Does anyone

Parquet Gzipped Files

2017-02-13 Thread Benjamin Kim
We are receiving files from an outside vendor who creates a Parquet data file and Gzips it before delivery. Does anyone know how to Gunzip the file in Spark and inject the Parquet data into a DataFrame? I thought using sc.textFile or sc.wholeTextFiles would automatically Gunzip the file, but

Re: is dataframe thread safe?

2017-02-13 Thread Mark Hamstra
If you update the data, then you don't have the same DataFrame anymore. If you don't do like Assaf did, caching and forcing evaluation of the DataFrame before using that DataFrame concurrently, then you'll still get consistent and correct results, but not necessarily efficient results. If the

Re: is dataframe thread safe?

2017-02-13 Thread vincent gromakowski
How about having a thread that update and cache a dataframe in-memory next to other threads requesting this dataframe, is it thread safe ? 2017-02-13 9:02 GMT+01:00 Reynold Xin : > Yes your use case should be fine. Multiple threads can transform the same > data frame in

Re: Order of rows not preserved after cache + count + coalesce

2017-02-13 Thread Jon Gregg
Spark has a zipWithIndex function for RDDs ( http://stackoverflow.com/a/26081548) that adds an index column right after you create an RDD, and I believe it preserves order. Then you can sort it by the index after the cache step. I haven't tried this with a Dataframe but this answer seems

[Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Cosmin Posteuca
Hi, I think i don't understand enough how to launch jobs. I have one job which takes 60 seconds to finish. I run it with following command: spark-submit --executor-cores 1 \ --executor-memory 1g \ --driver-memory 1g \ --master yarn \

Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread Jon Gregg
Setting Spark's memoryOverhead configuration variable is recommended in your logs, and has helped me with these issues in the past. Search for "memoryOverhead" here: http://spark.apache.org/docs/latest/running-on-yarn.html That said, you're running on a huge cluster as it is. If it's possible

Does Spark support heavy duty third party libraries?

2017-02-13 Thread bhayes
I have a rather heavy metal shared library which among other options can also be accessed via a Java/JNI wrapper JAR (the library itself is written in C++). This library needs up to 1000 external files which in total can be larger than 50 GBytes in size. At init time all this data needs to be read

Re: How to measure IO time in Spark over S3

2017-02-13 Thread Steve Loughran
Hadoop 2.8's s3a does a lot more metrics here, most of which you can find on HDP-2.5 if you can grab those JARs. Everything comes out as hadoop JMX metrics, also readable & aggregatable through a call to FileSystem.getStorageStatistics Measuring IO time isn't something picked up, because it's

Re: Order of rows not preserved after cache + count + coalesce

2017-02-13 Thread Nicholas Chammas
RDDs and DataFrames do not guarantee any specific ordering of data. They are like tables in a SQL database. The only way to get a guaranteed ordering of rows is to explicitly specify an orderBy() clause in your statement. Any ordering you see otherwise is incidental. ​ On Mon, Feb 13, 2017 at

Order of rows not preserved after cache + count + coalesce

2017-02-13 Thread David Haglund (external)
Hi, I found something that surprised me, I expected the order of the rows to be preserved, so I suspect this might be a bug. The problem is illustrated with the Python example below: In [1]: df = spark.createDataFrame([(i,) for i in range(3)], ['n']) df.cache() df.count()

Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread nancy henry
Hi All,, I am getting below error while I am trying to join 3 tables which are in ORC format in hive from 5 10gb tables through hive context in spark Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Re: Remove dependence on HDFS

2017-02-13 Thread Calvin Jia
Hi Ben, You can replace HDFS with a number of storage systems since Spark is compatible with other storage like S3. This would allow you to scale your compute nodes solely for the purpose of adding compute power and not disk space. You can deploy Alluxio on your compute nodes to offset the

Re: Remove dependence on HDFS

2017-02-13 Thread Saisai Shao
IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem layer which supports different FS implementations, HDFS is just one option. You could also use S3 as a backend FS, from Spark's point it is transparent to different FS implementations. On Sun, Feb 12, 2017 at 5:32 PM, ayan

Re: is dataframe thread safe?

2017-02-13 Thread Reynold Xin
Yes your use case should be fine. Multiple threads can transform the same data frame in parallel since they create different data frames. On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf wrote: > Hi, > > I was wondering if dataframe is considered thread safe. I know

Re: is dataframe thread safe?

2017-02-13 Thread 任弘迪
for my understanding, all transformations are thread-safe cause dataframe is just a description of the calculation and it's immutable, so the case above is all right. just be careful with the actions. On Sun, Feb 12, 2017 at 4:06 PM, Mendelson, Assaf wrote: > Hi, > > I