Re: PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-29 Thread Bhaarat Sharma
I'm very new to Spark. Im running it on a single CentOS7 box. How would I add a test.py to spark submit? Point to any resources would be great. Thanks for your help. On Sat, Jul 30, 2016 at 1:28 AM, ayan guha wrote: > I think you need to add test.py in spark submit so that

Re: PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-29 Thread ayan guha
I think you need to add test.py in spark submit so that it gets shipped to all executors On Sat, Jul 30, 2016 at 3:24 PM, Bhaarat Sharma wrote: > I am using PySpark 1.6.1. In my python program I'm using ctypes and trying > to load the liblept library via the

Re: Java Recipes for Spark

2016-07-29 Thread ayan guha
Hi Is there anything similar with Python? Else I can create one. On Sat, Jul 30, 2016 at 2:19 PM, Shiva Ramagopal wrote: > +1 for the Java love :-) > > On 30-Jul-2016 4:39 AM, "Renato Perini" wrote: > >> Not only very useful, but finally some Java

PySpark 1.6.1: 'builtin_function_or_method' object has no attribute '__code__' in Pickles

2016-07-29 Thread Bhaarat Sharma
I am using PySpark 1.6.1. In my python program I'm using ctypes and trying to load the liblept library via the liblept.so.4.0.2 file on my system. While trying to load the library via cdll.LoadLibrary("liblept.so.4.0.2") I get an error : 'builtin_function_or_method' object has no attribute

Re: Java Recipes for Spark

2016-07-29 Thread Shiva Ramagopal
+1 for the Java love :-) On 30-Jul-2016 4:39 AM, "Renato Perini" wrote: > Not only very useful, but finally some Java love :-) > > Thank you. > > > Il 29/07/2016 22:30, Jean Georges Perrin ha scritto: > >> Sorry if this looks like a shameless self promotion, but some of

sql to spark scala rdd

2016-07-29 Thread kali.tumm...@gmail.com
Hi All, I managed to write business requirement in spark-sql and hive I am still learning scala how this below sql be written using spark RDD not spark data frames. SELECT DATE,balance, SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) daily_balance FROM table

Re: Java Recipes for Spark

2016-07-29 Thread Renato Perini
Not only very useful, but finally some Java love :-) Thank you. Il 29/07/2016 22:30, Jean Georges Perrin ha scritto: Sorry if this looks like a shameless self promotion, but some of you asked me to say when I'll have my Java recipes for Apache Spark updated. It's done here:

Spark 1.6.1 Workaround: Properly handle signal kill of ApplicationMaster

2016-07-29 Thread jatinder85
https://issues.apache.org/jira/browse/SPARK-13642 Does anybody know reliable workaround on this issue in 1.6.1? Thanks, Jatinder -- View this message in context:

Re: Problems initializing SparkUI

2016-07-29 Thread Mich Talebzadeh
why chance it. Best to explicitly specify in spark-submit (or whatever) which port to listen to --conf "spark.ui.port=nnn" and see if it works HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Problems initializing SparkUI

2016-07-29 Thread Jacek Laskowski
Hi, I'm curious about "For some reason, sometimes the SparkUI does not appear to be bound on port 4040 (or any other) but the application runs perfectly and finishes giving the expected answer." How do you check that web UI listens to the port 4040? Pozdrawiam, Jacek Laskowski

Re: Java Recipes for Spark

2016-07-29 Thread Gavin Yue
This is useful:) Thank you for sharing. > On Jul 29, 2016, at 1:30 PM, Jean Georges Perrin wrote: > > Sorry if this looks like a shameless self promotion, but some of you asked me > to say when I'll have my Java recipes for Apache Spark updated. It's done > here:

Java Recipes for Spark

2016-07-29 Thread Jean Georges Perrin
Sorry if this looks like a shameless self promotion, but some of you asked me to say when I'll have my Java recipes for Apache Spark updated. It's done here: http://jgp.net/2016/07/22/spark-java-recipes/ and in the GitHub repo. Enjoy / have a

use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-29 Thread Andy Davidson
Hi Pedro I did some experiments. I using one of our relatively small data set. The data set is loaded into 3 or 4 data frames. I then call count() Looks like using bigger files and reading from HDFS is a good solution for reading data. I guess I¹ll need to do something similar to this to deal

Spark 1.6.1 Workaround: Properly handle signal kill of ApplicationMaster

2016-07-29 Thread Jatinder Assi
https://issues.apache.org/jira/browse/SPARK-13642 Does anybody know reliable workaround on this issue in 1.6.1? Thanks, Jatinder

pyspark 1.6.1 `partitionBy` does not provide meaningful information for `join` to use

2016-07-29 Thread Sisyphuss
import numpy as np def id(x): return x rdd = sc.parallelize(np.arange(1000)) rdd = rdd.map(lambda x: (x,1)) rdd = rdd.partitionBy(8, id) rdd = rdd.cache().setName('milestone') rdd.join(rdd).collect() The above code generates this DAG:

multiple SPARK_LOCAL_DIRS causing strange behavior in parallelism

2016-07-29 Thread Saif.A.Ellafi
Hi all, I was currently playing around with spark-env around SPARK_LOCAL_DIRS in order to add additional shuffle storage. But since I did this, I am getting too many open files error if total executor cores is high. I am also getting low parallelism, by monitoring the running tasks on some

RE: HBase-Spark Module

2016-07-29 Thread David Newberger
Hi Ben, This seems more like a question for community.cloudera.com. However, it would be in hbase not spark I believe. https://repository.cloudera.com/artifactory/webapp/#/artifacts/browse/tree/General/cloudera-release-repo/org/apache/hbase/hbase-spark David Newberger -Original

Re: The main difference use case between orderBY and sort

2016-07-29 Thread Mich Talebzadeh
Within the realm of ANSI SQL there is ORDER BY but no SORT BY. ORDERR BY sorts the result set in ascending or descending order. In SQL sorting is the term and ORDER BY is part of the syntax. In map-reduce pragma for example in Hive QL, SORT BY sorts data per reducer. As I understand the

HBase-Spark Module

2016-07-29 Thread Benjamin Kim
I would like to know if anyone has tried using the hbase-spark module? I tried to follow the examples in conjunction with CDH 5.8.0. I cannot find the HBaseTableCatalog class in the module or in any of the Spark jars. Can someone help? Thanks, Ben

Re: The main difference use case between orderBY and sort

2016-07-29 Thread Daniel Santana
As far as I know *sort* is just an alias of *orderBy* (or vice-versa) And your last operation is taking longer because you are sorting it twice. -- *Daniel Santana* Senior Software Engineer EVERY*MUNDO* 25 SE 2nd Ave., Suite 900 Miami, FL 33131 USA main:+1 (305) 375-0045 EveryMundo.com

Re: sampling operation for DStream

2016-07-29 Thread Cody Koeninger
Most stream systems you're still going to incur the cost of reading each message... I suppose you could rotate among reading just the latest messages from a single partition of a Kafka topic if they were evenly balanced. But once you've read the messages, nothing's stopping you from filtering

The main difference use case between orderBY and sort

2016-07-29 Thread Ashok Kumar
Hi, In Spark programing I can use df.filter(col("transactiontype") === "DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total Debit Card")).orderBy("transactiondate").show(5) or df.filter(col("transactiontype") ===

Tuning level of Parallelism: Increase or decrease?

2016-07-29 Thread Jestin Ma
I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the block size specified by hdfs, which happens to be 128 MB, leading to about 15000 tasks. I'm using 5 worker nodes with 16 cores each and ~25 GB RAM. I'm performing groupBy, count, and an outer-join with another

sampling operation for DStream

2016-07-29 Thread Martin Le
Hi all, I have to handle high-speed rate data stream. To reduce the heavy load, I want to use sampling techniques for each stream window. It means that I want to process a subset of data instead of whole window data. I saw Spark support sampling operations for RDD, but for DStream, Spark supports

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
Great! Common sense is very uncommon. On Fri, Jul 29, 2016 at 8:26 PM, Ewan Leith wrote: > If you replace the df.write …. > > > > With > > > > df.count() > > > > in your code you’ll see how much time is taken to process the full > execution plan without the write

RE: how to save spark files as parquets efficiently

2016-07-29 Thread Ewan Leith
If you replace the df.write …. With df.count() in your code you’ll see how much time is taken to process the full execution plan without the write output. That code below looks perfectly normal for writing a parquet file yes, there shouldn’t be any tuning needed for “normal” performance.

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
Thanks Sumit, please post back how your test with Hbase go. On Fri, Jul 29, 2016 at 8:06 PM, Sumit Khanna wrote: > Hey Ayan, > > A. Create a table TGT1 as (select key,info from delta UNION ALL select > key,info from TGT where key not in (select key from SRC)). Rename

Re: tpcds for spark2.0

2016-07-29 Thread Olivier Girardot
I have the same kind of issue (not using spark-sql-perf), just trying to deploy 2.0.0 on mesos. I'll keep you posted as I investigate On Wed, Jul 27, 2016 1:06 PM, kevin kiss.kevin...@gmail.com wrote: hi,all: I want to have a test about tpcds99 sql run on spark2.0. I user

Re: estimation of necessary time of execution

2016-07-29 Thread Mich Talebzadeh
hi, what is that function in Hive as a matter of interest? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
Hey Gourav, Well so I think that it is my execution plan that is at fault. So basically df.write as a spark job on localhost:4040/ well being an action will include the time taken for all the umpteen transformation on it right? All I wanted to know is "what apt env/config params are needed to

estimation of necessary time of execution

2016-07-29 Thread pseudo oduesp
Hi, on hive we have a awosome function for estimation of time of execution before launch ? in spark can find any function to estimate the time of lineage of spark dag execution ? Thanks

Re: Spark 2.0 Build Failed

2016-07-29 Thread Ascot Moss
I think my maven is broken, I used another node in the cluster to compile 2.0.0 and got "successful" [INFO] [INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @ java8-tests_2.11 --- [INFO] [INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Gourav Sengupta
Hi, The default write format in SPARK is parquet. And I have never faced any issues writing over a billion records in SPARK. Are you using virtualization by any chance or an obsolete hard disk or Intel Celeron may be? Regards, Gourav Sengupta On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
Hey, So I believe this is the right format to save the file, as in optimization is never in the write part, but with the head / body of my execution plan isnt it? Thanks, On Fri, Jul 29, 2016 at 11:57 AM, Sumit Khanna wrote: > Hey, > > master=yarn > mode=cluster > >

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
Hey Ayan, A. Create a table TGT1 as (select key,info from delta UNION ALL select key,info from TGT where key not in (select key from SRC)). Rename TGT1 to TGT. Not in can be written other variations using Outer Join B. Assuming SRC and TGT have a timestamp, B.1. Select latest records

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
This is a classic case compared to hadoop vs DWH implmentation. Source (Delta table): SRC. Target: TGT Requirement: Pure Upsert, ie just keep the latest information for each key. Options: A. Create a table TGT1 as (select key,info from delta UNION ALL select key,info from TGT where key not in

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
Just a note, I had the delta_df keys for the filter as in NOT INTERSECTION udf broadcasted to all the worker nodes. Which I think is an efficient move enough. Thanks, On Fri, Jul 29, 2016 at 12:19 PM, Sumit Khanna wrote: > Hey, > > the very first run : > > glossary : > >

correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
Hey, the very first run : glossary : delta_df := current run / execution changes dataframe. def deduplicate : apply windowing function and group by def partitionDataframe(delta_df) : get unique keys of that data frame and then return an array of data frames each containing just that very same

how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
Hey, master=yarn mode=cluster spark.executor.memory=8g spark.rpc.netty.dispatcher.numThreads=2 All the POC on a single node cluster. the biggest bottle neck being : 1.8 hrs to save 500k records as a parquet file/dir executing this command :

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-29 Thread Tony Lane
I am facing the same issue and completely blocked here. *Sean can you please help with this issue. * Migrating to 2.0.0 has really stalled our development effort. -Tony > -- Forwarded message -- > From: Sean Owen > Date: Fri, Jul 29, 2016 at 12:47 AM >