Re: spark source Intellij

2016-01-15 Thread Ted Yu
See: http://search-hadoop.com/m/q3RTtZbuxxp9p6N1=Re+Best+IDE+Configuration > On Jan 15, 2016, at 2:19 AM, Sanjeev Verma wrote: > > I want to configure spark source code into Intellij IDEA Is there any > document available / known steps which can guide me to

Re: NPE when using Joda DateTime

2016-01-15 Thread Romain Sagean
Hi, I had a similar problem with Joda Time though i didn't use Kryo, the solution I found was to use standard java date and time classes instead of Joda. 2016-01-15 13:16 GMT+01:00 Sean Owen : > I haven't dug into this, but I agree that something that is transient > isn't

Re: AIC in Linear Regression in ml pipeline

2016-01-15 Thread Yanbo Liang
Hi Arunkumar, It does not support output AIC value for Linear Regression currently. This feature is under development and will be released at Spark 2.0. Thanks Yanbo 2016-01-15 17:20 GMT+08:00 Arunkumar Pillai : > Hi > > Is it possible to get AIC value in Linear

Re: NPE when using Joda DateTime

2016-01-15 Thread Sean Owen
I haven't dug into this, but I agree that something that is transient isn't meant to be restored by the default Java serialization mechanism. I'd expect the class handles restoring that value as needed or in a custom readObject method. And then I don't know how Kryo interacts with that. I don't

RE: NPE when using Joda DateTime

2016-01-15 Thread Spencer, Alex (Santander)
I'll try the hackier way for now - given the limitation of not being able to modify the environment we've been given. Thanks all for your help so far. Kind Regards, Alex.   -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: 15 January 2016 12:17 To: Spencer, Alex

Re: Using JDBC clients with "Spark on Hive"

2016-01-15 Thread Daniel Darabos
Does Hive JDBC work if you are not using Spark as a backend? I just had very bad experience with Hive JDBC in general. E.g. half the JDBC protocol is not implemented (https://issues.apache.org/jira/browse/HIVE-3175, filed in 2012). On Fri, Jan 15, 2016 at 2:15 AM, sdevashis

spark source Intellij

2016-01-15 Thread Sanjeev Verma
I want to configure spark source code into Intellij IDEA Is there any document available / known steps which can guide me to configure spark project in to the Intellij IDEA. Any help will be appreciated Thanks

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Arkadiusz Bicz
Why do you need to be only one file? Spark doing good job writing in many files. On Fri, Jan 15, 2016 at 7:48 AM, Patrick McGloin wrote: > Hi, > > I would like to reparation / coalesce my data so that it is saved into one > Parquet file per partition. I would also like

sqlContext.cacheTable("tableName") vs dataFrame.cache()

2016-01-15 Thread George Sigletos
According to the documentation they are exactly the same, but in my queries dataFrame.cache() results in much faster execution times vs doing sqlContext.cacheTable("tableName") Is there any explanation about this? I am not caching the RDD prior to creating the dataframe. Using Pyspark on Spark

Re: Spark App -Yarn-Cluster-Mode ===> Hadoop_conf_**.zip file.

2016-01-15 Thread Ted Yu
bq. check application tracking page:http://slave1:8088/proxy/application_1452763526769_0011/ Then , ... Have you done the above to see what error was in each attempt ? Which Spark / hadoop release are you using ? Thanks On Fri, Jan

jobs much slower in cluster mode vs local

2016-01-15 Thread Saif.A.Ellafi
Hello, In general, I am usually able to run spark submit jobs in local mode, in a 32-cores node with plenty of memory ram. The performance is significantly faster in local mode than when using a cluster of spark workers. How can this be explained and what measures can one take in order to

Re: jobs much slower in cluster mode vs local

2016-01-15 Thread Jiří Syrový
Hi, you can try to use spark job server and submit jobs to it. The thing is that the most expensive part is context creation. J. 2016-01-15 15:28 GMT+01:00 : > Hello, > > In general, I am usually able to run spark submit jobs in local mode, in a > 32-cores node

Spark App -Yarn-Cluster-Mode ===> Hadoop_conf_**.zip file.

2016-01-15 Thread Siddharth Ubale
Hi, I am trying to run a Spark streaming application in yarn-cluster mode. However I am facing an issue where the job ends asking for a particular Hadoop_conf_**.zip file in hdfs location. Can any one guide with this? The application works fine in local mode only it stops abruptly for want of

Stacking transformations and using intermediate results in the next transformation

2016-01-15 Thread Richard Siebeling
Hi, we're stacking multiple RDD operations on each other, for example as a source we have a RDD[List[String]] like ["a", "b, c", "d"] ["a", "d, a", "d"] In the first step we split the second column in two columns, in the next step we filter the data on column 3 = "c" and in the final step we're

Re: Serialization stack error

2016-01-15 Thread Ted Yu
Here is signature for Get: public class Get extends Query implements Row, Comparable { It is not Serializable. FYI On Fri, Jan 15, 2016 at 6:37 AM, beeshma r wrote: > HI i am trying to get data from Solr server . > > This is my code > > /*input is JavaRDD li >

RE: jobs much slower in cluster mode vs local

2016-01-15 Thread Spencer, Alex (Santander)
That's not that much of a difference given the overhead of cluster management. I would have thought a job should take minutes before you'll see a performance improvement on using cluster mode? Kind Regards, Alex. From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: 15

simultaneous actions

2016-01-15 Thread Kira
Hi, Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this be done ? Thank you, Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html Sent from the Apache Spark User List mailing list archive at

Serialization stack error

2016-01-15 Thread beeshma r
HI i am trying to get data from Solr server . This is my code /*input is JavaRDD li *output is JavaRDD for scanning Hbase*/ public static JavaRDD getdocs(JavaRDD li) { JavaRDD newdocs=li; JavaRDD n=newdocs.map(new Function(){ public Get

Re: Spark App -Yarn-Cluster-Mode ===> Hadoop_conf_**.zip file.

2016-01-15 Thread Ted Yu
Interesting. Which hbase / Phoenix releases are you using ? The following method has been removed from Put: public Put setWriteToWAL(boolean write) { Please make sure the Phoenix release is compatible with your hbase version. Cheers On Fri, Jan 15, 2016 at 6:20 AM, Siddharth Ubale <

Re: Serialization stack error

2016-01-15 Thread Ted Yu
Can you encapsulate your map function such that it returns data type other than Get ? You can perform query to hbase but don't return Get. Cheers On Fri, Jan 15, 2016 at 6:46 AM, beeshma r wrote: > Hi Ted , > > Any suggestions for changing this piece of code? > > public

Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney
Threads El viernes, 15 de enero de 2016, Kira escribió: > Hi, > > Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this > be done ? > > Thank you, > Regards > > > > -- > View this message in context: >

RE: NPE when using Joda DateTime

2016-01-15 Thread Spencer, Alex (Santander)
OK, this isn’t very efficient, but I’ve found a solution: I am still passing around joda DateTime’s – however if I ever want to use any of the minusHours/minusDays etc “transient” variables, I simply do this: var tempDate = new org.joda.time.DateTime(transaction.date.toInstant) then:

RE: Spark App -Yarn-Cluster-Mode ===> Hadoop_conf_**.zip file.

2016-01-15 Thread Siddharth Ubale
Hi, This is the log from the application : 16/01/15 19:23:19 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.) 16/01/15 19:23:19 INFO yarn.ApplicationMaster: Deleting staging directory

Re: Serialization stack error

2016-01-15 Thread beeshma r
Hi Ted , Any suggestions for changing this piece of code? public static JavaRDD getdocs(JavaRDD< SolrDocumentList> li) { JavaRDD newdocs=li; JavaRDD n=newdocs.map(new Function(){ public Get call(SolrDocumentList si) throws IOException

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Andrew Weiner
Indeed! Here is the output when I run in cluster mode: Traceback (most recent call last): File "pi.py", line 22, in ? raise RuntimeError("\n"+str(sys.version_info) +"\n"+ RuntimeError: (2, 4, 3, 'final', 0) [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH',

Spark Streaming: routing by key without groupByKey

2016-01-15 Thread Lin Zhao
I have requirement to route a paired DStream to a series of map and flatMap such that entries with the same key goes to the same thread within the same batch. Closest I can come up with is groupByKey().flatMap(_._2). But this kills throughput by 50%. When I think about it groupByKey is more

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Cheng Lian
You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all data belonging to a single (data) partition into a single (RDD) partition: |df.coalesce(1)|||.repartition("entity", "year", "month", "day", "status")|.write.partitionBy("entity", "year", "month", "day",

RE: jobs much slower in cluster mode vs local

2016-01-15 Thread Saif.A.Ellafi
Thank you, this looks useful indeed for what I have in mind. Saif From: Jiří Syrový [mailto:syrovy.j...@gmail.com] Sent: Friday, January 15, 2016 12:06 PM To: Ellafi, Saif A. Cc: user@spark.apache.org Subject: Re: jobs much slower in cluster mode vs local Hi, you can try to use spark job

Re: sqlContext.cacheTable("tableName") vs dataFrame.cache()

2016-01-15 Thread Kevin Mellott
Hi George, I believe that sqlContext.cacheTable("tableName") is to be used when you want to cache the data that is being used within a Spark SQL query. For example, take a look at the code below. > val myData = sqlContext.load("com.databricks.spark.csv", Map("path" -> > "hdfs://somepath/file",

Feature importance for RandomForestRegressor in Spark 1.5

2016-01-15 Thread Scott Imig
Hello, I have a couple of quick questions about this pull request, which adds feature importance calculations to the random forests in MLLib. https://github.com/apache/spark/pull/7838 1. Can someone help me determine the Spark version where this is first available? (1.5.0? 1.5.1?) 2.

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-15 Thread Robin East
re 1. The pull requests reference the JIRA ticket in this case https://issues.apache.org/jira/browse/SPARK-5133 . The JIRA says it was released in 1.5. --- Robin East

Re: simultaneous actions

2016-01-15 Thread Sean Owen
Can you run N jobs depending on the same RDD in parallel on the driver? certainly. The context / scheduling is thread-safe and the RDD is immutable. I've done this to, for example, build and evaluate a bunch of models simultaneously on a big cluster. On Fri, Jan 15, 2016 at 7:10 PM, Jakob Odersky

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Patrick McGloin
I will try this in Monday. Thanks for the tip. On Fri, 15 Jan 2016, 18:58 Cheng Lian wrote: > You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all > data belonging to a single (data) partition into a single (RDD) partition: > >

Re: Spark Streaming + Kafka + scala job message read issue

2016-01-15 Thread vivek.meghanathan
All, The issue was related to apache Cassandra. I have changed the Cassandra to datastax Cassandra and the issue is resolved. Also I have changed the spark version to 1.3. There is some serious issue is there between spark Cassandra connector and apache Cassandra 2.1+ while using in spark

Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
I don't think RDDs are threadsafe. More fundamentally however, why would you want to run RDD actions in parallel? The idea behind RDDs is to provide you with an abstraction for computing parallel operations on distributed data. Even if you were to call actions from several threads at once, the

Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney
SparkContext is thread safe. And RDDs just describe operations. While I generally agree that you want to model as much possible as transformations as possible, this is not always possible. And in that case, you have no option than to use threads. Spark's designers should have made all actions

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Andrew Weiner
I tried playing around with my environment variables, and here is an update. When I run in cluster mode, my environment variables do not persist throughout the entire job. For example, I tried creating a local copy of HADOOP_CONF_DIR in /home//local/etc/hadoop/conf, and then, in spark-env.sh I

RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive

2016-01-15 Thread Sambit Tripathy (RBEI/EDS1)
Hi Mohammed, I think this is something you can do at the Thrift server startup. So this would run an instance of Derby and act as a Metastore. Any idea if this Debry Metastore will have distributed access and why do we use the Hive Metastore then? @Angela: I would also be happy to have a

Re: simultaneous actions

2016-01-15 Thread Matei Zaharia
RDDs actually are thread-safe, and quite a few applications use them this way, e.g. the JDBC server. Matei > On Jan 15, 2016, at 2:10 PM, Jakob Odersky wrote: > > I don't think RDDs are threadsafe. > More fundamentally however, why would you want to run RDD actions in >

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Andrew Weiner
Actually, I just found this [ https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of googling and reading leads me to believe that the preferred way to change the yarn environment is to edit the spark-defaults.conf file by adding this line: spark.yarn.appMasterEnv.PYSPARK_PYTHON

Re: simultaneous actions

2016-01-15 Thread Koert Kuipers
we run multiple actions on the same (cached) rdd all the time, i guess in different threads indeed (its in akka) On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia wrote: > RDDs actually are thread-safe, and quite a few applications use them this > way, e.g. the JDBC

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Andrew Weiner
I finally got the pi.py example to run in yarn cluster mode. This was the key insight: https://issues.apache.org/jira/browse/SPARK-9229 I had to set SPARK_YARN_USER_ENV in spark-env.sh: export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python" This caused the PYSPARK_PYTHON

Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
I stand corrected. How considerable are the benefits though? Will the scheduler be able to dispatch jobs from both actions simultaneously (or on a when-workers-become-available basis)? On 15 January 2016 at 11:44, Koert Kuipers wrote: > we run multiple actions on the same

Re: simultaneous actions

2016-01-15 Thread Sean Owen
It makes sense if you're parallelizing jobs that have relatively few tasks, and have a lot of execution slots available. It makes sense to turn them loose all at once and try to use the parallelism available. There are downsides, eventually: for example, N jobs accessing one cached RDD may

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-15 Thread Michael Armbrust
See here for some workarounds: https://issues.apache.org/jira/browse/SPARK-12546 On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam wrote: > Hi Arkadiusz, > > the partitionBy is not designed to have many distinct value the last time > I used it. If you search in the mailing list,

has any one implemented TF_IDF using ML transformers?

2016-01-15 Thread Andy Davidson
I wonder if I am missing something? TF-IDF is very popular. Spark ML has a lot of transformers how ever it TF_IDF is not supported directly. Spark provide a HashingTF and IDF transformer. The java doc http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf Mentions you can

Re: [Spark 1.6][Streaming] About the behavior of mapWithState

2016-01-15 Thread Shixiong(Ryan) Zhu
Hey Terry, That's expected. If you want to only output (1, 3), you can use "reduceByKey" before "mapWithState" like this: dstream.reduceByKey(_ + _).mapWithState(spec) On Fri, Jan 15, 2016 at 1:21 AM, Terry Hoo wrote: > Hi, > I am doing a simple test with mapWithState,

Re: 答复: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-15 Thread Shixiong(Ryan) Zhu
I see. So when your job fails, `jsc.awaitTermination();` will throw an exception. Then you app main method will exit and trigger the shutdown hook and call `jsc.stop()`. On Thu, Jan 14, 2016 at 10:20 PM, Triones,Deng(vip.com) < triones.d...@vipshop.com> wrote: > Thanks for your response . > >

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-15 Thread Jerry Lam
Hi Michael, Thanks for sharing the tip. It will help to the write path of the partitioned table. Do you have similar suggestion on reading the partitioned table back when there is a million of distinct values on the partition field (for example on user id)? Last time I have trouble to read a

Re: How To Save TF-IDF Model In PySpark

2016-01-15 Thread Andy Davidson
Are you using 1.6.0 or an older version? I think I remember something in 1.5.1 saying save was not implemented in python. The current doc does not say anything about save() http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

Re: How To Save TF-IDF Model In PySpark

2016-01-15 Thread Jerry Lam
Can you save it to parquet with the vector in one field? Sent from my iPhone > On 15 Jan, 2016, at 7:33 pm, Andy Davidson > wrote: > > Are you using 1.6.0 or an older version? > > I think I remember something in 1.5.1 saying save was not implemented in >

Spark streaming: Fixed time aggregation & handling driver failures

2016-01-15 Thread ffarozan
I am implementing aggregation using spark streaming and kafka. My batch and window size are same. And the aggregated data is persisted in Cassandra. I want to aggregate for fixed time windows - 5:00, 5:05, 5:10, ... But we cannot control when to run streaming job, we only get to specify the

Re: Sending large objects to specific RDDs

2016-01-15 Thread Ted Yu
My knowledge of XSEDE is limited - I visited the website. If there is no easy way to deploy HBase, alternative approach (using hdfs ?) needs to be considered. I need to do more homework on this :-) On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman wrote: > Hi Ted, >

RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive

2016-01-15 Thread Mohammed Guller
Sambit - I believe the default Derby-based metastore allows only one active user at a time. You can replace it with MySQL or Postgres. Using the Hive Metastore enables Spark SQL to be compatible with Hive. If you have an existing Hive setup, you can Spark SQL to process data in your Hive

Consuming commands from a queue

2016-01-15 Thread Afshartous, Nick
Hi, We have a streaming job that consumes from Kafka and outputs to S3. We're going to have the job also send commands (to copy from S3 to Redshift) into a different Kafka topic. What would be the best framework for consuming and processing the copy commands ? We're considering creating

Re: Compiling only MLlib?

2016-01-15 Thread Ted Yu
Looks like you didn't have zinc running. Take a look at install_zinc() in build/mvn, around line 83. You can use build/mvn instead of running mvn directly. I normally use the following command line: bin/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0 package

Re: Compiling only MLlib?

2016-01-15 Thread Matei Zaharia
Have you tried just downloading a pre-built package, or linking to Spark through Maven? You don't need to build it unless you are changing code inside it. Check out http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications for how to link to it. Matei > On Jan 15,

Executor initialize before all resources are ready

2016-01-15 Thread Byron Wang
Hi, I am building metrics system for Spark Streaming job, in the system, the metrics are collected in each executor, so a metrics source (a class used to collect metrics) needs to be initialized in each executor. The metrics source is packaged in a jar, when submitting a job, the jar is sent from

Re: Consuming commands from a queue

2016-01-15 Thread Cody Koeninger
Reading commands from kafka and triggering a redshift copy is sufficiently simple it could just be a bash script. But if you've already got a spark streaming job set up, may as well use it for consistency's sake. There's definitely no need to mess around with akka. On Fri, Jan 15, 2016 at 6:25

Re: Spark streaming: Fixed time aggregation & handling driver failures

2016-01-15 Thread Cody Koeninger
You can't really use spark batches as the basis for any kind of reliable time aggregation. Time of batch processing in general has nothing to do with time of event. You need to filter / aggregate by the time interval you care about, in your own code, or use a data store that can do the

Re: Multi tenancy, REST and MLlib

2016-01-15 Thread Kevin Mellott
It sounds like you may be interested in a solution that implements the Lambda Architecture , such as Oryx2 . At a high level, this gives you the ability to request and receive information immediately (serving layer), generating

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Bryan Cutler
Glad you got it going! It's wasn't very obvious what needed to be set, maybe it is worth explicitly stating this in the docs since it seems to have come up a couple times before too. Bryan On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner < andrewweiner2...@u.northwestern.edu> wrote: > Actually,

Multi tenancy, REST and MLlib

2016-01-15 Thread feribg
I'm fairly new to Spark and Mllib, but i'm doing some research into multi tenancy of mllib based app. The idea is to provide ability to train models on demand with certain constraints (executor size) and then allow to serve predictions from those models via a REST layer. So far from my research

How To Save TF-IDF Model In PySpark

2016-01-15 Thread Asim Jalis
Hi, I am trying to save a TF-IDF model in PySpark. Looks like this is not supported. Using `model.save()` causes: AttributeError: 'IDFModel' object has no attribute 'save' Using `pickle` causes: TypeError: can't pickle lock objects Does anyone have suggestions Thanks! Asim Here is the

Compiling only MLlib?

2016-01-15 Thread Colin Woodbury
Hi, I'm very much interested in using Spark's MLlib in standalone programs. I've never used Hadoop, and don't intend to deploy on massive clusters. Building Spark has been an honest nightmare, and I've been on and off it for weeks. The build always runs out of RAM on my laptop (4g of RAM, Arch

spark.master overwritten in standalone plus cluster deploy-mode

2016-01-15 Thread shanson
Issue: 'spark.master' not updated in driver launch command when mastership changes. Settings: --deploy-mode cluster \ --supervise \ --master "spark://master1:6066,master2:6066,master3:6066" \ --conf "spark.master=spark://master1:7077,master2:7077,master3:7077" ... java opts and

Re: Executor initialize before all resources are ready

2016-01-15 Thread Ted Yu
Which Spark release are you using ? Thanks On Fri, Jan 15, 2016 at 7:08 PM, Byron Wang wrote: > Hi, I am building metrics system for Spark Streaming job, in the system, > the > metrics are collected in each executor, so a metrics source (a class used > to > collect metrics)

AIC in Linear Regression in ml pipeline

2016-01-15 Thread Arunkumar Pillai
Hi Is it possible to get AIC value in Linear Regression using ml pipeline ? Is so please help me -- Thanks and Regards Arun

[Spark 1.6][Streaming] About the behavior of mapWithState

2016-01-15 Thread Terry Hoo
Hi, I am doing a simple test with mapWithState, and get some events unexpected, is this correct? The test is very simple: sum the value of each key val mappingFunction = (key: Int, value: Option[Int], state: State[Int]) => { state.update(state.getOption().getOrElse(0) + value.getOrElse(0))

RE: NPE when using Joda DateTime

2016-01-15 Thread Spencer, Alex (Santander)
Hi, I tried Zhu’s recommendation and sadly got the same error. (Again, single map worked by the groupBy / flatMap generates this error). Does Kryo has a bug i.e. it’s not serialising all components needed, or do I just need to get our IT team to install those magro Serializers as suggested by