Re: How to implement an Evaluator for a ML pipeline?

2015-05-20 Thread Stefan H.
Thanks, Xiangrui, for clarifying the metric and creating that JIRA issue. I made an error while composing my earlier mail: paramMap.get(als.regParam) in my Evaluator actually returns None. I just happended to use getOrElse(1.0) in my tests, which explains why negating the metric did not

Re: Spark Streaming to Kafka

2015-05-20 Thread twinkle sachdeva
Thanks Saisai. On Wed, May 20, 2015 at 11:23 AM, Saisai Shao sai.sai.s...@gmail.com wrote: I think here is the PR https://github.com/apache/spark/pull/2994 you could refer to. 2015-05-20 13:41 GMT+08:00 twinkle sachdeva twinkle.sachd...@gmail.com: Hi, As Spark streaming is being nicely

Re: spark streaming doubt

2015-05-20 Thread Akhil Das
One receiver basically runs on 1 core, so if your single node is having 4 cores, there are still 3 cores left for the processing (for executors). And yes receiver remains on the same machine unless some failure happens. Thanks Best Regards On Tue, May 19, 2015 at 10:57 PM, Shushant Arora

Re: Mesos Spark Tasks - Lost

2015-05-20 Thread Tim Chen
Can you share your exact spark-submit command line? And also cluster mode is not yet released yet (1.4) and doesn't support spark-shell, so I think you're just using client mode unless you're using latest master. Tim On Tue, May 19, 2015 at 8:57 AM, Panagiotis Garefalakis panga...@gmail.com

Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Sean Owen
I don't think that's quite the difference. Any SQL engine has a query planner and an execution engine. Both of these Spark for execution. HoS uses Hive for query planning. Although it's not optimized for execution on Spark per se, it's got a lot of language support and is stable/mature. Spark

Is this a good use case for Spark?

2015-05-20 Thread jakeheller
Hi all, I'm new to Spark -- so new that we're deciding whether to use it in the first place, and I was hoping someone here could help me figure that out. We're doing a lot of processing of legal documents -- in particular, the entire corpus of American law. It's about 10m documents, many of

Re: spark 1.3.1 jars in repo1.maven.org

2015-05-20 Thread Sean Owen
Yes, the published artifacts can only refer to one version of anything (OK, modulo publishing a large number of variants under classifiers). You aren't intended to rely on Spark's transitive dependencies for anything. Compiling against the Spark API has no relation to what version of Hadoop it

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-20 Thread Tomasz Fruboes
Hi, thanks for answer. The rights are drwxr-xr-x 3 tfruboes all 5632 05-19 15:40 test19EE/ I have tried setting the rights to 777 for this directory prior to execution. This does not get propagated down the chain, ie the directory created as a result of the save call (namesAndAges.parquet2

java program Get Stuck at broadcasting

2015-05-20 Thread allanjie
​Hi All, The variable I need to broadcast is just 468 MB. When broadcasting, it just “stop” at here: * 15/05/20 11:36:14 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/05/20 11:36:14 INFO Configuration.deprecation: mapred.task.id is deprecated.

Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Debasish Das
SparkSQL was built to improve upon Hive on Spark runtime further... On Tue, May 19, 2015 at 10:37 PM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hive on Spark and SparkSQL which should be better , and what are the key characteristics and the advantages and the disadvantages

Re: Spark and RabbitMQ

2015-05-20 Thread Abel Rincón
Hi, There is a RabbitMQ reciver for spark-streaming http://search.maven.org/#artifactdetails|com.stratio.receiver|rabbitmq|0.1.0-RELEASE|jar https://github.com/Stratio/RabbitMQ-Receiver 2015-05-12 14:49 GMT+02:00 Dmitry Goldenberg dgoldenberg...@gmail.com: Thanks, Akhil. It looks like in

Re: Reading Binary files in Spark program

2015-05-20 Thread Akhil Das
If you can share the complete code and a sample file, may be i can try to reproduce it on my end. Thanks Best Regards On Wed, May 20, 2015 at 7:00 AM, Tapan Sharma tapan.sha...@gmail.com wrote: Problem is still there. Exception is not coming at the time of reading. Also the count of

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-20 Thread Iulian Dragoș
You could try setting `SPARK_USER` to the user under which your workers are running. I couldn't find many references to this variable, but at least Yarn and Mesos take it into account when spawning executors. Chances are that standalone mode also does it. iulian On Wed, May 20, 2015 at 9:29 AM,

RE: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Evo Eftimov
Check whether the name can be resolved in the /etc/hosts file (or DNS) of the worker (the same btw applies for the Node where you run the driver app – all other nodes must be able to resolve its name) From: Stephen Boesch [mailto:java...@gmail.com] Sent: Wednesday, May 20, 2015 10:07

How to set HBaseConfiguration in Spark

2015-05-20 Thread donhoff_h
Hi, all I wrote a program to get HBaseConfiguration object in Spark. But after I printed the content of this hbase-conf object, I found they were wrong. For example, the property hbase.zookeeper.quorum should be bgdt01.dev.hrb,bgdt02.dev.hrb,bgdt03.hrb. But the printed value is localhost.

PySpark Logs location

2015-05-20 Thread Oleg Ruchovets
Hi , I am executing PySpark job on yarn ( hortonworks distribution). Could someone pointing me where is the log locations? Thanks Oleg.

Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Hemant Bhanawat
Hi, I have compiled a list (from online sources) of knobs/design considerations that need to be taken care of by applications running on spark streaming. Is my understanding correct? Any other important design consideration that I should take care of? - A DStream is associated with a single

Re: Hive on Spark VS Spark SQL

2015-05-20 Thread ayan guha
And if I am not wrong, spark SQL api is intended to move closer to SQL standards. I feel its a clever decision on spark's part to keep both APIs operational. These short term confusions worth the long term benefits. On 20 May 2015 17:19, Sean Owen so...@cloudera.com wrote: I don't think that's

Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Stephen Boesch
What conditions would cause the following delays / failure for a standalone machine/cluster to have the Worker contact the Master? 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at http://10.0.0.3:8081 15/05/20 02:02:53 INFO Worker: Connecting to master

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-20 Thread Tomasz Fruboes
Thanks for a suggestion. I have tried playing with it, sc.sparkUser() gives me expected user name, but it doesnt solve the problem. From a quick search through the spark code it seems to me, that this setting is effective only for yarn and mesos. I think the workaround for the problem could

Re: spark streaming doubt

2015-05-20 Thread Shushant Arora
So I can explicitly specify no of receivers and executors in receiver based streaming? Can you share a sample program if any? Also in Low level non receiver based , will data be fetched by same worker executor node and processed ? Also if I have concurrent jobs set to 1- so in low level fetching

saveasorcfile on partitioned orc

2015-05-20 Thread patcharee
Hi, I followed the information on https://www.mail-archive.com/reviews@spark.apache.org/msg141113.html to save orc file with spark 1.2.1. I can save data to a new orc file. I wonder how to save data to an existing and partitioned orc file? Any suggestions? BR, Patcharee

Re: Code error

2015-05-20 Thread Romain Sagean
Hi Ricardo, instead of filtering header just remove the header of your file. In your code you create a filter for the header but you don't use it to compute parsedData. val parsedData = filter_data.map(s = Vectors.dense(s.split(','). map(_.toDouble))).cache() 2015-05-19 21:23 GMT+02:00 Stephen

Re: spark streaming doubt

2015-05-20 Thread Akhil Das
On Wed, May 20, 2015 at 1:12 PM, Shushant Arora shushantaror...@gmail.com wrote: So I can explicitly specify no of receivers and executors in receiver based streaming? Can you share a sample program if any? ​ ​ -You can look at the lowlevel consumer repo

LATERAL VIEW explode issue

2015-05-20 Thread kiran mavatoor
Hi, When I use LATERAL VIEW explode on the registered temp table in spark shell, it works.  But when I use the same in spark-submit (as jar file) it is not working. its giving error -  failure: ``union'' expected but identifier VIEW found sql statement i am using is SELECT id,mapKey FROM

Re: Reading Binary files in Spark program

2015-05-20 Thread Tapan Sharma
I am not doing anything special. *Here is the code :* SparkConf sparkConf = new SparkConf().setAppName(JavaSequenceFile); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaPairRDDString, Byte seqFiles = ctx.sequenceFile(args[0], String.class, Byte.class) ; // Following statements is

java program got Stuck at broadcasting

2015-05-20 Thread allanjie
The variable I need to broadcast is just 468 MB. When broadcasting, it just “stop” at here: *15/05/20 11:36:14 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/05/20 11:36:14 INFO Configuration.deprecation: mapred.task.id is deprecated.

RE: LATERAL VIEW explode issue

2015-05-20 Thread yana
Just a guess but are you using HiveContext in one case vs SqlContext inanother? You dont show a stacktrace but this looks like parser error...Which would make me guess different  context or different spark versio on the cluster you are submitting to... Sent on the new Sprint Network from my

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-20 Thread MEETHU MATHEW
Hi Davies,Thank you for pointing to spark streaming. I am confused about how to return the result after running a function via  a thread.I tried using Queue to add the results to it and print it at the end.But here, I can see the results after all threads are finished.How to get the result of

Initial job has not accepted any resources

2015-05-20 Thread podioss
Hi, i am running spark jobs with standalone resource manager and i am gathering several performance metrics from my cluster nodes. I am also gathering disk io metrics from my nodes and because many of my jobs are using the same dataset i am trying to prevent the operating system from caching the

Re: save column values of DataFrame to text file

2015-05-20 Thread allanjie
Sorry, bt how does that work? Can u specify the detail about the problem? On 20 May 2015 at 21:32, oubrik [via Apache Spark User List] ml-node+s1001560n2295...@n3.nabble.com wrote: hi, try like thiis DataFrame df = sqlContext.load(com.databricks.spark.csv, options); df.select(year,

Re: Reading Binary files in Spark program

2015-05-20 Thread Akhil Das
Hi Basically, you need to convert it to a serializable format before doing the collect. You can fire up a spark shell and paste this: val sFile = sc.sequenceFile[LongWritable, Text](/home/akhld/sequence/sigmoid) *.map(_._2.toString)* sFile.take(5).foreach(println) Use the

Re: How to use spark to access HBase with Security enabled

2015-05-20 Thread Bill Q
I have similar problem that I cannot pass the HBase configuration file as extra classpath to Spark any more using spark.executor.extraClassPath=MY_HBASE_CONF_DIR in the Spark 1.3. We used to run this in 1.2 without any problem. On Tuesday, May 19, 2015, donhoff_h 165612...@qq.com wrote: Sorry,

Re: Incrementally add/remove vertices in GraphX

2015-05-20 Thread vzaychik
Any updates on GraphX Streaming? There was mention of this about a year ago, but nothing much since. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Incrementally-add-remove-vertices-in-GraphX-tp2227p22963.html Sent from the Apache Spark User List

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-20 Thread Nicholas Chammas
To put this on the devs' radar, I suggest creating a JIRA for it (and checking first if one already exists). issues.apache.org/jira/ Nick On Tue, May 19, 2015 at 1:34 PM Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, this definitely seems useful there. There might also be some ways to

Re: Is this a good use case for Spark?

2015-05-20 Thread Davies Liu
Spark is a great framework to do things in parallel with multiple machines, will be really helpful for your case. Once you can wrap your entire pipeline into a single Python function: def process_document(path, text): # you can call other tools or services here return xxx then you can

Re: java program Get Stuck at broadcasting

2015-05-20 Thread Akhil Das
This is more like an issue with your HDFS setup, can you check in the datanode logs? Also try putting a new file in HDFS and see if that works. Thanks Best Regards On Wed, May 20, 2015 at 11:47 AM, allanjie allanmcgr...@gmail.com wrote: ​Hi All, The variable I need to broadcast is just 468

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-20 Thread Davies Liu
I think this is a general multiple-threading question, Queue is the right direction to go. Have you try something like this? results = Queue.Queue() def run_job(f, args): r = f(*args) results.put(r) # start multiple threads to run jobs threading.Thread(target=run_job, args=(f,

Re: How to set HBaseConfiguration in Spark

2015-05-20 Thread Naveen Madhire
Cloudera blog has some details. Please check if this is helpful to you. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Thanks. On Wed, May 20, 2015 at 4:21 AM, donhoff_h 165612...@qq.com wrote: Hi, all I wrote a program to get HBaseConfiguration object in Spark.

Re: LATERAL VIEW explode issue

2015-05-20 Thread kiran mavatoor
Hi Yana, I was using sqlContext in the program by creating new SqlContext(sc). This was created the problem when i submit the job using spark-submit. Where as, when I run the same program in spark-shell, the default context is hive context (it seems ) and every thing seems to be fine. This

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-05-20 Thread Sean Owen
I don't think any of those problems are related to Hadoop. Have you looked at userClassPathFirst settings? On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson ejsa...@gmail.com wrote: Hi Sean and Ted, Thanks for your replies. I don't have our current problems nicely written up as good

FP Growth saveAsTextFile

2015-05-20 Thread Eric Tanner
I am having trouble with saving an FP-Growth model as a text file. I can print out the results, but when I try to save the model I get a NullPointerException. model.freqItemsets.saveAsTextFile(c://fpGrowth/model) Thanks, Eric

Re: Spark 1.3.1 - SQL Issues

2015-05-20 Thread Davies Liu
The docs had been updated. You should convert the DataFrame to RDD by `df.rdd` On Mon, Apr 20, 2015 at 5:23 AM, ayan guha guha.a...@gmail.com wrote: Hi Just upgraded to Spark 1.3.1. I am getting an warning Warning (from warnings module): File

GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Don Drake
I'm running Spark v1.3.1 and when I run the following against my dataset: model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatu res, maxDepth=6, numIterations=3) The job will fail with the following message: Traceback (most recent call last): File

Re: Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Tathagata Das
Correcting the ones that are incorrect or incomplete. BUT this is good list for things to remember about Spark Streaming. On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com wrote: Hi, I have compiled a list (from online sources) of knobs/design considerations that need to

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-20 Thread Davies Liu
Could you file a JIRA for this? The executor should run under the user who submit a job, I think. On Wed, May 20, 2015 at 2:40 AM, Tomasz Fruboes tomasz.frub...@fuw.edu.pl wrote: Thanks for a suggestion. I have tried playing with it, sc.sparkUser() gives me expected user name, but it doesnt

Re: PySpark Logs location

2015-05-20 Thread Oleg Ruchovets
Hi Ruslan. Could you add more details please. Where do I get applicationId? In case I have a lot of log files would it make sense to view it from single point. How actually I can configure / manage log location of PySpark? Thanks Oleg. On Wed, May 20, 2015 at 10:24 PM, Ruslan Dautkhanov

Re: FP Growth saveAsTextFile

2015-05-20 Thread Xiangrui Meng
Could you post the stack trace? If you are using Spark 1.3 or 1.4, it would be easier to save freq itemsets as a Parquet file. -Xiangrui On Wed, May 20, 2015 at 12:16 PM, Eric Tanner eric.tan...@justenough.com wrote: I am having trouble with saving an FP-Growth model as a text file. I can

Read multiple files from S3

2015-05-20 Thread lovelylavs
Hi, I am trying to get a collection of files according to LastModifiedDate from S3 List String FileNames = new ArrayListString(); ListObjectsRequest listObjectsRequest = new ListObjectsRequest() .withBucketName(s3_bucket) .withPrefix(logs_dir);

Re: User Defined Type (UDT)

2015-05-20 Thread Xiangrui Meng
Probably in 1.5. I made a JIRA for it: https://issues.apache.org/jira/browse/SPARK-7768. You can watch that JIRA (and vote). -Xiangrui On Wed, May 20, 2015 at 11:03 AM, Justin Uang justin.u...@gmail.com wrote: Xiangrui, is there a timeline for when UDTs will become a public API? I'm currently

Re: PySpark Logs location

2015-05-20 Thread Ruslan Dautkhanov
Oleg, You can see applicationId in your Spark History Server. Go to http://historyserver:18088/ Also check https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application It should be no different with PySpark. -- Ruslan Dautkhanov On Wed, May 20, 2015 at 2:12 PM, Oleg

Storing data in MySQL from spark hive tables

2015-05-20 Thread roni
Hi , I am trying to setup the hive metastore and mysql DB connection. I have a spark cluster and I ran some programs and I have data stored in some hive tables. Now I want to store this data into Mysql so that it is available for further processing. I setup the hive-site.xml file. ?xml

Re: User Defined Type (UDT)

2015-05-20 Thread Justin Uang
Xiangrui, is there a timeline for when UDTs will become a public API? I'm currently using them to support java 8's ZonedDateTime. On Tue, May 19, 2015 at 3:14 PM Xiangrui Meng men...@gmail.com wrote: (Note that UDT is not a public API yet.) On Thu, May 7, 2015 at 7:11 AM, wjur

Re: Spark users

2015-05-20 Thread Akhil Das
Yes, this is the user group. Feel free to ask your questions in this list. Thanks Best Regards On Wed, May 20, 2015 at 5:58 AM, Ricardo Goncalves da Silva ricardog.si...@telefonica.com wrote: Hi I'm learning spark focused on data and machine learning. Migrating from SAS. There is a group

Fwd: Re: spark 1.3.1 jars in repo1.maven.org

2015-05-20 Thread Edward Sargisson
Hi Sean and Ted, Thanks for your replies. I don't have our current problems nicely written up as good questions yet. I'm still sorting out classpath issues, etc. In case it is of help, I'm seeing: * Exception in thread Spark Context Cleaner java.lang.NoClassDefFoundError: 0 at

Re: --jars works in yarn-client but not yarn-cluster mode, why?

2015-05-20 Thread Marcelo Vanzin
Hello, Sorry for the delay. The issue you're running into is because most HBase classes are in the system class path, while jars added with --jars are only visible to the application class loader created by Spark. So classes in the system class path cannot see them. You can work around this by

Re: PySpark Logs location

2015-05-20 Thread Ruslan Dautkhanov
You could use yarn logs -applicationId application_1383601692319_0008 -- Ruslan Dautkhanov On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I am executing PySpark job on yarn ( hortonworks distribution). Could someone pointing me where is the log

IPv6 support

2015-05-20 Thread Kevin Liu
Hello, I have to work with IPv6 only servers and when I installed the 1.3.1 hadoop 2.6 build, I couldn¹t get the example to run due to IPv6 issues (errors below). I tried to add the -Djava.net.preferIPv6Addresses=true setting but it still doesn¹t work. A search on Spark¹s support for IPv6 is

Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Burak Yavuz
Could you please open a JIRA for it? The maxBins input is missing for the Python Api. Is it possible if you can use the current master? In the current master, you should be able to use trees with the Pipeline Api and DataFrames. Best, Burak On Wed, May 20, 2015 at 2:44 PM, Don Drake

Re: Storing data in MySQL from spark hive tables

2015-05-20 Thread Yana Kadiyska
I'm afraid you misunderstand the purpose of hive-site.xml. It configures access to the Hive metastore. You can read more here: http://www.hadoopmaterial.com/2013/11/metastore.html. So the MySQL DB in hive-site.xml would be used to store hive-specific data such as schema info, partition info, etc.

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread DB Tsai
Hi Xin, If you take a look at the model you trained, the intercept from Spark is significantly smaller than StatsModel, and the intercept represents a prior on categories in LOR which causes the low accuracy in Spark implementation. In LogisticRegressionWithLBFGS, the intercept is regularized due

Spatial function in spark

2015-05-20 Thread developer developer
Hello , i am fairly new to spark and python programming. I have an RDD with polygons, i need to perform spatial joins , geohash calculations and other spatial operations on these RDDs parallelly. I run spark jobs on yarn cluster, and develop spark applications in python. So, can u please

Spark Application Dependency Issue

2015-05-20 Thread Snehal Nagmote
Hi All, I am on spark 1.1 with Datastax DSE. Application is Spark Streaming and have Couchbase dependencies which uses http-core 4.3.2 . While running application I get this error This is the error I get NoSuchMethodError: org.apache.http.protocol.RequestUserAgent.init(Ljava/lang/String;)V

Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Xin Liu
Hi, I have tried a few models in Mllib to train a LogisticRegression model. However, I consistently get much better results using other libraries such as statsmodel (which gives similar results as R) in terms of AUC. For illustration purpose, I used a small data (I have tried much bigger data)

How to process data in chronological order

2015-05-20 Thread roy
I have a key-value RDD, key is a timestamp (femto-second resolution, so grouping buys me nothing) and I want to reduce it in the chronological order. How do I do that in spark? I am fine with reducing contiguous sections of the set separately and then aggregating the resulting objects locally.

Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Joseph Bradley
One more comment: That's a lot of categories for a feature. If it makes sense for your data, it will run faster if you can group the categories or split the 1895 categories into a few features which have fewer categories. On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz brk...@gmail.com wrote:

Re: Spark 1.3.1 - SQL Issues

2015-05-20 Thread ayan guha
Thanks a bunch On 21 May 2015 07:11, Davies Liu dav...@databricks.com wrote: The docs had been updated. You should convert the DataFrame to RDD by `df.rdd` On Mon, Apr 20, 2015 at 5:23 AM, ayan guha guha.a...@gmail.com wrote: Hi Just upgraded to Spark 1.3.1. I am getting an warning

Re: Spark Job not using all nodes in cluster

2015-05-20 Thread Shailesh Birari
No. I am not setting the number of executors anywhere (in env file or in program). Is it due to large number of small files ? On Wed, May 20, 2015 at 5:11 PM, ayan guha guha.a...@gmail.com wrote: What is your spark env file says? Are you setting number of executors in spark context? On 20

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Joseph Bradley
Hi Xin, 2 suggestions: 1) Feature scaling: spark.mllib's LogisticRegressionWithLBFGS uses feature scaling, which scales feature values to have unit standard deviation. That improves optimization behavior, and it often improves statistical estimation (though maybe not for your dataset).

Cannot submit SparkPi to Standalone (1.3.1) running on another Server (Both Linux)

2015-05-20 Thread Carey Sublette
I am attempting to submit a job (using SparkPi) from one Linux machine (Ubuntu 14.04) to Spark 1.3.1 running in standalone mode on another Linux machine (Xubuntu 12.04; spartacus.servile.war), but I cannot make a connection. I have investigated everything I can think of to diagnose/fix the

Help needed with Py4J

2015-05-20 Thread Addanki, Santosh Kumar
Hi Colleagues We need to call a Scala Class from pySpark in Ipython notebook. We tried something like below : from py4j.java_gateway import java_import java_import(sparkContext._jvm,'mynamespace') myScalaClass = sparkContext._jvm.SimpleScalaClass () myScalaClass.sayHello(World) Works Fine

Re: --jars works in yarn-client but not yarn-cluster mode, why?

2015-05-20 Thread Fengyun RAO
Thank you so much, Marcelo! It WORKS! 2015-05-21 2:05 GMT+08:00 Marcelo Vanzin van...@cloudera.com: Hello, Sorry for the delay. The issue you're running into is because most HBase classes are in the system class path, while jars added with --jars are only visible to the application class

Spark build with Hive

2015-05-20 Thread guoqing0...@yahoo.com.hk
Hi , is the Spark-1.3.1 can build with the Hive-1.2 ? it seem to Spark-1.3.1 can only build with 0.13 , 0.12 according to the document . # Apache Hadoop 2.4.X with Hive 13 support mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package # Apache Hadoop

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Chris Gore
I tried running this data set as described with my own implementation of L2 regularized logistic regression using LBFGS to compare: https://github.com/cdgore/fitbox https://github.com/cdgore/fitbox Intercept: -0.886745823033 Weights (['gre', 'gpa', 'rank']):[ 0.28862268 0.19402388 -0.36637964]

Re: Help needed with Py4J

2015-05-20 Thread Addanki, Santosh Kumar
Yeah ... I am able to instantiate the simple scala class as explained below which is from the same JAR Regards Santosh On May 20, 2015, at 7:26 PM, Holden Karau hol...@pigscanfly.camailto:hol...@pigscanfly.ca wrote: Are your jars included in both the driver and worker class paths? On

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-20 Thread Tathagata Das
If you are talking about handling driver crash failures, then all bets are off anyways! Adding a shutdown hook in the hope of handling driver process failure, handles only a some cases (Ctrl-C), but does not handle cases like SIGKILL (does not run JVM shutdown hooks) or driver machine crash. So

Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Don Drake
JIRA created: https://issues.apache.org/jira/browse/SPARK-7781 Joseph, I agree, I'm debating removing this feature altogether, but I'm putting the model through its paces. Thanks. -Don On Wed, May 20, 2015 at 7:52 PM, Joseph Bradley jos...@databricks.com wrote: One more comment: That's a lot

Re: RE: Spark build with Hive

2015-05-20 Thread guoqing0...@yahoo.com.hk
Thanks very much , Which version will be support In the upcome 1.4 ? I hope it will be support more versions. guoqing0...@yahoo.com.hk From: Cheng, Hao Date: 2015-05-21 11:20 To: Ted Yu; guoqing0...@yahoo.com.hk CC: user Subject: RE: Spark build with Hive Yes, ONLY support 0.12.0 and 0.13.1

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-20 Thread Tathagata Das
Has this been fixed for you now? There has been a number of patches since then and it may have been fixed. On Thu, May 14, 2015 at 7:20 AM, Wangfei (X) wangf...@huawei.com wrote: Yes it is repeatedly on my locally Jenkins. 发自我的 iPhone 在 2015年5月14日,18:30,Tathagata Das t...@databricks.com

Re: Help needed with Py4J

2015-05-20 Thread Holden Karau
Ah sorry, I missed that part (I've been dealing with some py4j stuff today as well and maybe skimmed it a bit too quickly). Do you have your code somewhere I could take a look at? Also does your constructor expect a JavaSparkContext or a regular SparkContext (if you look at how the SQLContext is

RE: RE: Spark build with Hive

2015-05-20 Thread Wang, Daoyuan
In 1.4 I think we still only support 0.12.0 and 0.13.1. From: guoqing0...@yahoo.com.hk [mailto:guoqing0...@yahoo.com.hk] Sent: Thursday, May 21, 2015 12:03 PM To: Cheng, Hao; Ted Yu Cc: user Subject: Re: RE: Spark build with Hive Thanks very much , Which version will be support In the upcome 1.4

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-20 Thread Dibyendu Bhattacharya
Thanks Tathagata for making this change.. Dibyendu On Thu, May 21, 2015 at 8:24 AM, Tathagata Das t...@databricks.com wrote: If you are talking about handling driver crash failures, then all bets are off anyways! Adding a shutdown hook in the hope of handling driver process failure, handles

Re: Help needed with Py4J

2015-05-20 Thread Holden Karau
Are your jars included in both the driver and worker class paths? On Wednesday, May 20, 2015, Addanki, Santosh Kumar santosh.kumar.adda...@sap.com wrote: Hi Colleagues We need to call a Scala Class from pySpark in Ipython notebook. We tried something like below : from

Re: Spark build with Hive

2015-05-20 Thread Ted Yu
I am afraid even Hive 1.0 is not supported, let alone Hive 1.2 Cheers On Wed, May 20, 2015 at 8:08 PM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hi , is the Spark-1.3.1 can build with the Hive-1.2 ? it seem to Spark-1.3.1 can only build with 0.13 , 0.12 according to the

Storing spark processed output to Database asynchronously.

2015-05-20 Thread Gautam Bajaj
Hi, From my understanding of Spark Streaming, I created a spark entry point, for continuous UDP data, using: SparkConf conf = new SparkConf().setMaster(local[2]).setAppName(NetworkWordCount);JavaStreamingContext jssc = new JavaStreamingContext(conf, new

View all user's application logs in history server

2015-05-20 Thread Jianshi Huang
Hi, I'm using Spark 1.4.0-rc1 and I'm using default settings for history server. But I can only see my own logs. Is it possible to view all user's logs? The permission is fine for the user group. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Mesos Spark Tasks - Lost

2015-05-20 Thread Panagiotis Garefalakis
Tim thanks for your reply, I am following this quite clear mesos-spark tutorial: https://docs.mesosphere.com/tutorials/run-spark-on-mesos/ So mainly I tried running spark-shell which locally works fine but when the jobs are submitted through mesos something goes wrong! My question is: is there a

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Yana Kadiyska
But if I'm reading his email correctly he's saying that: 1. The master and slave are on the same box (so network hiccups are unlikely culprit) 2. The failures are intermittent -- i.e program works for a while then worker gets disassociated... Is it possible that the master restarted? We used to