Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Ted Yu
Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is in use. FYI On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma wrote: > Ok. > I came across this issue. > Not sure if you already assessed this: >

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Deepak Sharma
Ok. I came across this issue. Not sure if you already assessed this: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921 The workaround mentioned may work for you . Thanks Deepak On 1 Jul 2016 9:34 am, "Chanh Le" wrote: > Hi Deepark, > Thank for

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Chanh Le
Hi Deepark, Thank for replying. The way to write into alluxio is df.write.mode(SaveMode.Append).partitionBy("network_id", "time").parquet("alluxio://master1:1/FACT_ADMIN_HOURLY”) I partition by 2 columns and store. I just want when I write it automatic write a size properly for what I

HiveContext

2016-06-30 Thread manish jaiswal
-- Forwarded message -- From: "manish jaiswal" Date: Jun 30, 2016 17:35 Subject: HiveContext To: , , < user-h...@spark.apache.org> Cc: Hi, I am new to Spark.I found using HiveContext we can connect

Re: One map per folder in spark or Hadoop

2016-06-30 Thread Balachandar R.A.
Thank you very much. I will try this code and update you Regards Bala On 01-Jul-2016 7:46 am, "Sun Rui" wrote: > Say you have got all of your folder paths into a val folders: Seq[String] > > val add = sc.parallelize(folders, folders.size).mapPartitions { iter => > val

Re: One map per folder in spark or Hadoop

2016-06-30 Thread Sun Rui
Say you have got all of your folder paths into a val folders: Seq[String] val add = sc.parallelize(folders, folders.size).mapPartitions { iter => val folder = iter.next val status: Int = Seq(status).toIterator } > On Jun 30, 2016, at 16:42, Balachandar R.A.

Re: Looking for help about stackoverflow in spark

2016-06-30 Thread Chanh Le
Hi John, I think it relates to drivers memory more than the others thing you said. Can you just increase more memory for driver? > On Jul 1, 2016, at 9:03 AM, johnzeng wrote: > > I am trying to load a 1 TB collection into spark cluster from mongo. But I am > keep getting

Looking for help about stackoverflow in spark

2016-06-30 Thread johnzeng
I am trying to load a 1 TB collection into spark cluster from mongo. But I am keep getting stack overflow error after running for a while. I have posted a question in stackoverflow.com, and tried all advies they have provide, nothing works... how to load large database into spark

RE: Spark 2.0 Continuous Processing

2016-06-30 Thread kmat
In a continuous processing pipeline with dataframes is there any way to checkpoint the processing state (by the user) at periodic intervals. The thought process behind this is to rewind to any particular checkpoint and then fast forward processing thereon. Date: Wed, 29 Jun 2016 23:17:47 -0700

RE: Spark jobs

2016-06-30 Thread Joaquin Alzola
HI Sujeet, Thinking that might not work Running this: #!/usr/bin/env python3 from pyspark_cassandra import CassandraSparkContext, Row from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf =

RE: Remote RPC client disassociated

2016-06-30 Thread Joaquin Alzola
>>> 16/06/30 10:44:34 ERROR util.Utils: Uncaught exception in thread stdout >>> writer for python java.lang.AbstractMethodError: pyspark_cassandra.DeferringRowReader.read(Lcom/datastax/driver/core/Row;Lcom/datastax/spark/connector/CassandraRowMetadata;)Ljava/lang/Object; >> You are trying to

Re: Logical Plan

2016-06-30 Thread Mich Talebzadeh
I don't think Spark optimizer supports something like statement cache where plan is cached and bind variables (like RDBMS) are used for different values, thus saving the parsing. What you re stating is that the source and tempTable change but the plan itself remains the same. I have not seen this

Re: Logical Plan

2016-06-30 Thread Darshan Singh
I am using 1.5.2. I have a data-frame with 10 column and then I pivot 1 column and generate the 700 columns. it is like val df1 = sqlContext.read.parquet("file1") df1.registerTempTable("df1") val df2= sqlContext.sql("select col1, col2, sum(case when col3 = 1 then col4 else 0.0 end ) as

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
That was indeed the case, using UTF8Deserializer makes everything work correctly. Thanks for the tips! On Thu, Jun 30, 2016 at 3:32 PM, Pedro Rodriguez wrote: > Quick update, I was able to get most of the plumbing to work thanks to the > code Holden posted and browsing

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
Quick update, I was able to get most of the plumbing to work thanks to the code Holden posted and browsing more source code. I am running into this error which makes me think that maybe I shouldn't be leaving the default python RDD serializer/pickler in place and do something else

Re: Logical Plan

2016-06-30 Thread Mich Talebzadeh
A logical plan should not change assuming the same DAG diagram is used throughout Have you tried Spark GUI Page under stages? This is Spark 2 example: [image: Inline images 1] HTH Dr Mich Talebzadeh LinkedIn *

Re: Logical Plan

2016-06-30 Thread Reynold Xin
Which version are you using here? If the underlying files change, technically we should go through optimization again. Perhaps the real "fix" is to figure out why is logical plan creation so slow for 700 columns. On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh wrote: >

Logical Plan

2016-06-30 Thread Darshan Singh
Is there a way I can use same Logical plan for a query. Everything will be same except underlying file will be different. Issue is that my query has around 700 columns and Generating logical plan takes 20 seconds and it happens every 2 minutes but every time underlying file is different. I do

RDD to DataFrame question with JsValue in the mix

2016-06-30 Thread Dood
Hello, I have an RDD[(String,JsValue)] that I want to convert into a DataFrame and then run SQL on. What is the easiest way to get the JSON (in form of JsValue) "understood" by the process? Thanks! - To unsubscribe e-mail:

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
Thanks Jeff and Holden, A little more context here probably helps. I am working on implementing the idea from this article to make reads from S3 faster: http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 (although my name is Pedro, I am not the author of the article). The

Re: Call Scala API from PySpark

2016-06-30 Thread Holden Karau
So I'm a little biased - I think the bet bride between the two is using DataFrames. I've got some examples in my talk and on the high performance spark GitHub https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/high_performance_pyspark/simple_perf_test.py calls

How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-06-30 Thread SRK
Hi, I need to do integration tests using Spark Streaming. My idea is to spin up kafka using docker locally and use it to feed the stream to my Streaming Job. Any suggestions on how to do this would be of great help. Thanks, Swetha -- View this message in context:

Re: Remote RPC client disassociated

2016-06-30 Thread Jeff Zhang
>>> 16/06/30 10:44:34 ERROR util.Utils: Uncaught exception in thread stdout writer for python java.lang.AbstractMethodError: pyspark_cassandra.DeferringRowReader.read(Lcom/datastax/driver/core/Row;Lcom/datastax/spark/connector/CassandraRowMetadata;)Ljava/lang/Object; You are trying to call an

Re: Call Scala API from PySpark

2016-06-30 Thread Jeff Zhang
Hi Pedro, Your use case is interesting. I think launching java gateway is the same as native SparkContext, the only difference is on creating your custom SparkContext instead of native SparkContext. You might also need to wrap it using java.

Re: Can Spark Dataframes preserve order when joining?

2016-06-30 Thread Takeshi Yamamuro
Hi, Most of join strategies do not preserve the orderings of input dfs (sort-merge joins only hold the ordering of a left input df). So, as said earlier, you need to explicitly sort them if you want ordered outputs. // maropu On Wed, Jun 29, 2016 at 3:38 PM, Mich Talebzadeh

Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
Hi All, I have written a Scala package which essentially wraps the SparkContext around a custom class that adds some functionality specific to our internal use case. I am trying to figure out the best way to call this from PySpark. I would like to do this similarly to how Spark itself calls the

Re: Possible to broadcast a function?

2016-06-30 Thread Aaron Perrin
That's helpful, thanks. I didn't see that thread earlier. But, it sounds like the best solution is to use singletons in the executors, which I'm already doing. (BTW - the reason why I consider that method kind of hack-ish, is because the it makes the code a bit more difficult for others to

Re: how to add a column according to an existing column of a dataframe?

2016-06-30 Thread nguyen duc tuan
About spark issue that you refer to, it's is not related to your problem :D In this case, you only have to to is using withColumn function. For example: import org.apache.spark.sql.functions._ val getRange = udf((x: Int) => get price range code ...) val priceRange = resultPrice.withColumn("range",

how to add a column according to an existing column of a dataframe?

2016-06-30 Thread luohui20001
hi guys, I have a dataframe with 3 columns, id(int) ,type(string) ,price(string) , and I want to add a column "price range", according to the value of price. I checked the SPARK-15383, however in my code I just want to append a column, which is transforming from the original dataframe

RE: Possible to broadcast a function?

2016-06-30 Thread Yong Zhang
How about this old discussion related to similar problem as yours. http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-td3203.html Yong From: aper...@timerazor.com Date: Wed, 29 Jun 2016 14:00:07 + Subject: Possible to broadcast a function? To:

Re: Error report file is deleted automatically after spark application finished

2016-06-30 Thread dhruve ashar
There could be multiple of them, why its not being generated even after setting the ulimit appropriately. Try out the options listed on this thread: http://stackoverflow.com/questions/7732983/core-dump-file-is-not-generated On Thu, Jun 30, 2016 at 2:25 AM, prateek arora

Remote RPC client disassociated

2016-06-30 Thread Joaquin Alzola
HI List, I am launching this spark-submit job: hadoop@testbedocg:/mnt/spark> bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.0 --jars /mnt/spark/lib/TargetHolding_pyspark-cassandra-0.3.5.jar spark_v2.py spark_v2.py is: from pyspark_cassandra import

Re: Spark master shuts down when one of zookeeper dies

2016-06-30 Thread Ted Yu
Looking at Master.scala, I don't see code that would bring master back up automatically. Probably you can implement monitoring tool so that you get some alert when master goes down. e.g. http://stackoverflow.com/questions/12896998/how-to-set-up-alerts-on-ganglia More experienced users may have

Re: Spark master shuts down when one of zookeeper dies

2016-06-30 Thread vimal dinakaran
Hi Ted, Thanks for the pointers. I had a three node zookeeper setup . Now the master alone dies when a zookeeper instance is down and a new master is elected as leader and the cluster is up. But the master that was down , never comes up. Is this the expected ? Is there a way to get alert when a

One map per folder in spark or Hadoop

2016-06-30 Thread Balachandar R.A.
Hello, I have some 100 folders. Each folder contains 5 files. I have an executable that process one folder. The executable is a black box and hence it cannot be modified.I would like to process 100 folders in parallel using Apache spark so that I should be able to span a map task per folder. Can

Re: How to use scala.tools.nsc.interpreter.IMain in Spark, just like calling eval in Perl.

2016-06-30 Thread Jayant Shekhar
Hi Fanchao, This is because it is unable to find the anonymous classes generated. Adding the below code worked for me. I found the details here : https://github.com/cloudera/livy/blob/master/repl/src/main/scala/com/cloudera/livy/repl/SparkInterpreter.scala // Spark 1.6 does not have

Re: Set the node the spark driver will be started

2016-06-30 Thread Felix Massem
Hey Bryan, yes this definitely sounds like the issue I have :-) Thx a lot and best regards Felix Felix Massem | IT-Consultant | Karlsruhe mobil: +49 (0) 172.2919848 <> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de

答复: deploy-mode flag in spark-sql cli

2016-06-30 Thread Huang Meilong
Thank you, I got it. 发件人: Mich Talebzadeh 发送时间: 2016年6月30日 14:52 收件人: Saisai Shao 抄送: Huang Meilong; user@spark.apache.org 主题: Re: deploy-mode flag in spark-sql cli Yes I forgot that anything with REPL both spark-sql and spark-shell

How to use scala.tools.nsc.interpreter.IMain in Spark, just like calling eval in Perl.

2016-06-30 Thread Fanchao Meng
Hi Spark Community, I am trying to dynamically interpret code given as a String in Spark, just like calling the eval in Perl language. However, I got problem when running the program. Really appreciate for your help. **Requirement:** The requirement is to make the spark processing chain

Change spark dataframe to LabeledPoint in Java

2016-06-30 Thread Abhishek Anand
Hi , I have a dataframe which i want to convert to labeled point. DataFrame labeleddf = model.transform(newdf).select("label","features"); How can I convert this to a LabeledPoint to use in my Logistic Regression model. I could do this in scala using val trainData = labeleddf.map(row =>

Re: Error report file is deleted automatically after spark application finished

2016-06-30 Thread prateek arora
Thanks for the information. My problem is resolved now . I have one more issue. I am not able to save core dump file. Always shows *“# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again"* I set core dump

Re: deploy-mode flag in spark-sql cli

2016-06-30 Thread Mich Talebzadeh
Yes I forgot that anything with REPL both spark-sql and spark-shell are simple convenience interfaces. Thanks Saisai for pointing out. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Using R code as part of a Spark Application

2016-06-30 Thread Sun Rui
I would guess that the technology behind Azure R Server is about Revolution Enterprise DistributedR/ScaleR. I don’t know the details, but the statement in the “Step 6. Install R packages” section in the given documentation page. However, if you need to install R packages on the worker nodes

Re: Using R code as part of a Spark Application

2016-06-30 Thread sujeet jog
Thanks for the link Sun, I believe running external Scripts like R code in Data Frames is a much needed facility, for example for the algorithms that are not available in MLLIB, invoking such from a R script would definitely be a powerful feature when your APP is Scala/Python based, you don;t