date:20150913

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-13 Thread Nick Pentreath

I should point out that I'm not sure what the performance of that project is. I'd expect that native data frame in PySpark will be significantly more efficient than their DictRDD. It would be interesting to see a performance comparison for the pipelines relative to native Spark ML

Re: Data lost in spark streaming

2015-09-13 Thread Tathagata Das

Maybe the driver got restarted. See the log4j logs of the driver before it restarted. On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang wrote: > I'm using spark streaming 1.4.0 and have a DStream that have all the data > it received. But today the history data in the DStream seems to

How to Hive UDF in Spark DataFrame?

2015-09-13 Thread unk1102

Hi I am using UDF in hiveContext.sql("") query inside it uses group by which forces huge data shuffle read of around 30 GB I am thinking to convert above query into DataFrame so that I avoid using group by. How do we use Hive UDF in Spark DataFrame? Please guide. Thanks much. -- View this

Re: selecting columns with the same name in a join

2015-09-13 Thread Evert Lammerts

Thanks Michael, we'll update then. Evert On Sep 11, 2015 20:59, "Michael Armbrust" wrote: > Here is what I get on branch-1.5: > > x = sc.parallelize([dict(k=1, v="Evert"), dict(k=2, v="Erik")]).toDF() > y = sc.parallelize([dict(k=1, v="Ruud"), dict(k=3,

Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Cazen Lee

Hi Owen Thank you for reply I heard that some peoples say ORC is Owen’s RC file haha ;) And, Some peoples tells to me after posting it’s already known issues about AWS EMR 4.0.0 They said that it might be Hive 0.13.1 and Spark 1.4.1 compatibility issue So AWS will launch EMR 4.1.0 in couple

Stopping SparkContext and HiveContext

2015-09-13 Thread Ophir Cohen

Hi, I'm working on my companie's system that constructs out of Spark, Zeppelin, Hive and some other technology and wonder regarding to ability to stop contexts. Working on the test framwork for the system, when run tests someting I would like to create new SparkContext in order to run the tests

Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Owen O'Malley

Do you have a stack trace of the array out of bounds exception? I don't remember an array out of bounds problem off the top of my head. A stack trace will tell me a lot, obviously. If you are using Spark 1.4 that implies Hive 0.13, which is pretty old. It may be a problem that we fixed a while

Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Cazen

Stacktrace are below. But someone told me that it's known issue and will be patched in couple of weeks(EMR 4.1.) So, dont' mind about that. I'll waiting until patched. scala> val ORCFile =

Re: What happens when cache is full?

2015-09-13 Thread Mailinglisten

Hello Jeff, I'm quite new in the Spark topic but as far as I understood caching Spark uses the available memory and if more memory is requested cached RDDs are thrown away in a LRU manner and will be recomputed when needed. Please correct me if I'm wrong Regards Lars > Am 13.09.2015 um

Re: UDAF and UDT with SparkSQL 1.5.0

2015-09-13 Thread jussipekkap

A small update: I was able to find a solution with good performance - using brickhouse collect (Hive UDAF). This also accept structs as an input, which is an ok workaround, but not perfect still (support for UDTs would be better). The built-in hive 'collect_list' seems to have a check for input

Re: Data lost in spark streaming

2015-09-13 Thread Bin Wang

There is some error logs in the executor and I don't know if it is related: 15/09/11 10:54:05 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$Inv alidToken): Invalid AMRMToken from

Re: Stopping SparkContext and HiveContext

2015-09-13 Thread Ted Yu

For #1, there is the following method: @DeveloperApi def getExecutorStorageStatus: Array[StorageStatus] = { assertNotStopped() You can wrap the call in try block catching IllegalStateException. Of course, this is just a workaround. FYI On Sun, Sep 13, 2015 at 1:48 AM, Ophir Cohen

Re: Data lost in spark streaming

2015-09-13 Thread Ted Yu

Can you retrieve log for appattempt_1440495451668_0258_01 and see if there is some clue there ? Cheers On Sun, Sep 13, 2015 at 3:28 AM, Bin Wang wrote: > There is some error logs in the executor and I don't know if it is related: > > 15/09/11 10:54:05 WARN ipc.Client:

Re: change the spark version

2015-09-13 Thread Steve Loughran

> On 12 Sep 2015, at 09:14, Sean Owen wrote: > > This is a question for the CDH list. CDH 5.4 has Spark 1.3, and 5.5 > has 1.5. The best thing is to update CDH as a whole if you can. > > However it's pretty simple to just run a newer Spark assembly as a > YARN app. Don't

Parquet partitioning performance issue

2015-09-13 Thread sonal sharma

Hi Team, We have scheduled jobs that read new records from MySQL database every hour and write (append) them to parquet. For each append operation, spark creates 10 new partitions in parquet file. Some of these partitions are fairly small in size (20-40 KB) leading to high number of smaller

Re: Parquet partitioning performance issue

2015-09-13 Thread Dean Wampler

One general technique is perform a second pass later over the files, for example the next day or once a week, to concatenate smaller files into larger ones. This can be done for all file types and allows you make recent data available to analysis tools, while avoiding a large build up of small

Re: RDD transformation and action running out of memory

2015-09-13 Thread Utkarsh Sengar

Yup, that was the problem. Changing the default " mongo.input.split_size" from 8MB to 100MB did the trick. Config reference: https://github.com/mongodb/mongo-hadoop/wiki/Configuration-Reference Thanks! On Sat, Sep 12, 2015 at 3:15 PM, Richard Eggert wrote: > Hmm...

Re: CREATE TABLE ignores database when using PARQUET option

2015-09-13 Thread hbogert

I'm having the same problem, did you solve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824p24679.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Best way to merge final output part files created by Spark job

2015-09-13 Thread unk1102

Hi I have a spark job which creates around 500 part files inside each directory I process. So I have thousands of such directories. So I need to merge these small small 500 part files. I am using spark.sql.shuffle.partition as 500 and my final small files are ORC files. Is there a way to merge orc

Re:Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-13 Thread Todd

Thanks Davies for the explanation. When I turn off the following options, I still see that spark1.5 is much slower than 1.4.1. I am thinking how I can configure so that spark 1.5 can have similar performance as spark1.4 for this particular query.. --conf spark.sql.planner.sortMergeJoin=false

Re: Stopping SparkContext and HiveContext

2015-09-13 Thread Ted Yu

Please also see this thread: http://search-hadoop.com/m/q3RTtGpLeLyv97B1 On Sun, Sep 13, 2015 at 9:49 AM, Ted Yu wrote: > For #1, there is the following method: > > @DeveloperApi > def getExecutorStorageStatus: Array[StorageStatus] = { > assertNotStopped() > > You

Re: Problem to persist Hibernate entity from Spark job

2015-09-13 Thread Zoran Jeremic

Hi guys, I'm still trying to solve the issue with saving Hibernate entities from Spark. After several attempts to redesign my own code I ended up with HelloWorld example which clearly demonstrates that it's not the problem in complexity of my code and session mixing in threads. The code given

Replacing Esper with Spark Streaming?

2015-09-13 Thread Otis Gospodnetić

Hi, I'm wondering if anyone has attempted to replace Esper with Spark Streaming or if anyone thinks Spark Streaming is/isn't a good tool for the (CEP) job? We are considering Akka or Spark Streaming as possible Esper replacements and would appreciate any input from people who tried to do that

Re: Training the MultilayerPerceptronClassifier

2015-09-13 Thread Feynman Liang

AFAIK no, we have a TODO item to implement more rigorous correctness tests (e.g. referenced against Weka). If you're interested, go ahead and comment the

Re: What is the best way to migrate existing scikit-learn code to PySpark?

Re: Data lost in spark streaming

How to Hive UDF in Spark DataFrame?

Re: selecting columns with the same name in a join

Re: [Question] ORC - EMRFS Problem

Stopping SparkContext and HiveContext

Re: [Question] ORC - EMRFS Problem

Re: [Question] ORC - EMRFS Problem

Re: What happens when cache is full?

Re: UDAF and UDT with SparkSQL 1.5.0

Re: Data lost in spark streaming

Re: Stopping SparkContext and HiveContext

Re: Data lost in spark streaming

Re: change the spark version

Parquet partitioning performance issue

Re: Parquet partitioning performance issue

Re: RDD transformation and action running out of memory

Re: CREATE TABLE ignores database when using PARQUET option

Best way to merge final output part files created by Spark job

Re:Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

Re: Stopping SparkContext and HiveContext

Re: Problem to persist Hibernate entity from Spark job

Replacing Esper with Spark Streaming?

Re: Training the MultilayerPerceptronClassifier

24 matches

Site Navigation

Mail list logo

Footer information