Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-13 Thread Nick Pentreath
I should point out that I'm not sure what the performance of that project is. I'd expect that native data frame in PySpark will be significantly more efficient than their DictRDD.  It would be interesting to see a performance comparison for the pipelines relative to native Spark ML

Re: Data lost in spark streaming

2015-09-13 Thread Tathagata Das
Maybe the driver got restarted. See the log4j logs of the driver before it restarted. On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang wrote: > I'm using spark streaming 1.4.0 and have a DStream that have all the data > it received. But today the history data in the DStream seems to

How to Hive UDF in Spark DataFrame?

2015-09-13 Thread unk1102
Hi I am using UDF in hiveContext.sql("") query inside it uses group by which forces huge data shuffle read of around 30 GB I am thinking to convert above query into DataFrame so that I avoid using group by. How do we use Hive UDF in Spark DataFrame? Please guide. Thanks much. -- View this

Re: selecting columns with the same name in a join

2015-09-13 Thread Evert Lammerts
Thanks Michael, we'll update then. Evert On Sep 11, 2015 20:59, "Michael Armbrust" wrote: > Here is what I get on branch-1.5: > > x = sc.parallelize([dict(k=1, v="Evert"), dict(k=2, v="Erik")]).toDF() > y = sc.parallelize([dict(k=1, v="Ruud"), dict(k=3,

Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Cazen Lee
Hi Owen Thank you for reply I heard that some peoples say ORC is Owen’s RC file haha ;) And, Some peoples tells to me after posting it’s already known issues about AWS EMR 4.0.0 They said that it might be Hive 0.13.1 and Spark 1.4.1 compatibility issue So AWS will launch EMR 4.1.0 in couple

Stopping SparkContext and HiveContext

2015-09-13 Thread Ophir Cohen
Hi, I'm working on my companie's system that constructs out of Spark, Zeppelin, Hive and some other technology and wonder regarding to ability to stop contexts. Working on the test framwork for the system, when run tests someting I would like to create new SparkContext in order to run the tests

Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Owen O'Malley
Do you have a stack trace of the array out of bounds exception? I don't remember an array out of bounds problem off the top of my head. A stack trace will tell me a lot, obviously. If you are using Spark 1.4 that implies Hive 0.13, which is pretty old. It may be a problem that we fixed a while

Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Cazen
Stacktrace are below. But someone told me that it's known issue and will be patched in couple of weeks(EMR 4.1.) So, dont' mind about that. I'll waiting until patched. scala> val ORCFile =

Re: What happens when cache is full?

2015-09-13 Thread Mailinglisten
Hello Jeff, I'm quite new in the Spark topic but as far as I understood caching Spark uses the available memory and if more memory is requested cached RDDs are thrown away in a LRU manner and will be recomputed when needed. Please correct me if I'm wrong Regards Lars > Am 13.09.2015 um

Re: UDAF and UDT with SparkSQL 1.5.0

2015-09-13 Thread jussipekkap
A small update: I was able to find a solution with good performance - using brickhouse collect (Hive UDAF). This also accept structs as an input, which is an ok workaround, but not perfect still (support for UDTs would be better). The built-in hive 'collect_list' seems to have a check for input

Re: Data lost in spark streaming

2015-09-13 Thread Bin Wang
There is some error logs in the executor and I don't know if it is related: 15/09/11 10:54:05 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$Inv alidToken): Invalid AMRMToken from

Re: Stopping SparkContext and HiveContext

2015-09-13 Thread Ted Yu
For #1, there is the following method: @DeveloperApi def getExecutorStorageStatus: Array[StorageStatus] = { assertNotStopped() You can wrap the call in try block catching IllegalStateException. Of course, this is just a workaround. FYI On Sun, Sep 13, 2015 at 1:48 AM, Ophir Cohen

Re: Data lost in spark streaming

2015-09-13 Thread Ted Yu
Can you retrieve log for appattempt_1440495451668_0258_01 and see if there is some clue there ? Cheers On Sun, Sep 13, 2015 at 3:28 AM, Bin Wang wrote: > There is some error logs in the executor and I don't know if it is related: > > 15/09/11 10:54:05 WARN ipc.Client:

Re: change the spark version

2015-09-13 Thread Steve Loughran
> On 12 Sep 2015, at 09:14, Sean Owen wrote: > > This is a question for the CDH list. CDH 5.4 has Spark 1.3, and 5.5 > has 1.5. The best thing is to update CDH as a whole if you can. > > However it's pretty simple to just run a newer Spark assembly as a > YARN app. Don't

Parquet partitioning performance issue

2015-09-13 Thread sonal sharma
Hi Team, We have scheduled jobs that read new records from MySQL database every hour and write (append) them to parquet. For each append operation, spark creates 10 new partitions in parquet file. Some of these partitions are fairly small in size (20-40 KB) leading to high number of smaller

Re: Parquet partitioning performance issue

2015-09-13 Thread Dean Wampler
One general technique is perform a second pass later over the files, for example the next day or once a week, to concatenate smaller files into larger ones. This can be done for all file types and allows you make recent data available to analysis tools, while avoiding a large build up of small

Re: RDD transformation and action running out of memory

2015-09-13 Thread Utkarsh Sengar
Yup, that was the problem. Changing the default " mongo.input.split_size" from 8MB to 100MB did the trick. Config reference: https://github.com/mongodb/mongo-hadoop/wiki/Configuration-Reference Thanks! On Sat, Sep 12, 2015 at 3:15 PM, Richard Eggert wrote: > Hmm...

Re: CREATE TABLE ignores database when using PARQUET option

2015-09-13 Thread hbogert
I'm having the same problem, did you solve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824p24679.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Best way to merge final output part files created by Spark job

2015-09-13 Thread unk1102
Hi I have a spark job which creates around 500 part files inside each directory I process. So I have thousands of such directories. So I need to merge these small small 500 part files. I am using spark.sql.shuffle.partition as 500 and my final small files are ORC files. Is there a way to merge orc

Re:Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-13 Thread Todd
Thanks Davies for the explanation. When I turn off the following options, I still see that spark1.5 is much slower than 1.4.1. I am thinking how I can configure so that spark 1.5 can have similar performance as spark1.4 for this particular query.. --conf spark.sql.planner.sortMergeJoin=false

Re: Stopping SparkContext and HiveContext

2015-09-13 Thread Ted Yu
Please also see this thread: http://search-hadoop.com/m/q3RTtGpLeLyv97B1 On Sun, Sep 13, 2015 at 9:49 AM, Ted Yu wrote: > For #1, there is the following method: > > @DeveloperApi > def getExecutorStorageStatus: Array[StorageStatus] = { > assertNotStopped() > > You

Re: Problem to persist Hibernate entity from Spark job

2015-09-13 Thread Zoran Jeremic
Hi guys, I'm still trying to solve the issue with saving Hibernate entities from Spark. After several attempts to redesign my own code I ended up with HelloWorld example which clearly demonstrates that it's not the problem in complexity of my code and session mixing in threads. The code given

Replacing Esper with Spark Streaming?

2015-09-13 Thread Otis Gospodnetić
Hi, I'm wondering if anyone has attempted to replace Esper with Spark Streaming or if anyone thinks Spark Streaming is/isn't a good tool for the (CEP) job? We are considering Akka or Spark Streaming as possible Esper replacements and would appreciate any input from people who tried to do that

Re: Training the MultilayerPerceptronClassifier

2015-09-13 Thread Feynman Liang
AFAIK no, we have a TODO item to implement more rigorous correctness tests (e.g. referenced against Weka). If you're interested, go ahead and comment the