Re: Custom Spark Error on Hadoop Cluster

2016-07-07 Thread Xiangrui Meng
This seems like a deployment or dependency issue. Please check the following: 1. The unmodified Spark jars were not on the classpath (already existed on the cluster or pulled in by other packages). 2. The modified jars were indeed deployed to both master and slave nodes. On Tue, Jul 5, 2016 at

Re: Spark as sql engine on S3

2016-07-07 Thread ayan guha
Yes, it can. On Fri, Jul 8, 2016 at 3:03 PM, Ashok Kumar wrote: > thanks so basically Spark Thrift Server runs on a port much like beeline > that uses JDBC to connect to Hive? > > Can Spark thrift server access Hive tables? > > regards > > > On Friday, 8 July 2016, 5:27,

Re: Spark as sql engine on S3

2016-07-07 Thread Ashok Kumar
thanks so basically Spark Thrift Server runs on a port much like beeline that uses JDBC to connect to Hive? Can Spark thrift server access Hive tables? regards On Friday, 8 July 2016, 5:27, ayan guha wrote: Spark Thrift Server..works as jdbc server. you can

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread ayan guha
Yes, absolutely. You need to "save" the table using saveAsTable function. Underlying storage is HDFS or any other storage and you are basically using spark's embedded hive services (when you do not have hadoop in the set up). Think STS as a JDBC server in front of your datastore. STS runs as a

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Mich Talebzadeh
Tableau has its own Tableau server that stores reports prepared not data. It does not cache data. What users do is to access Tableau server from their Tableau client and use reports. You still need to get data out from the persistent store. I have not heard of Tableau having its own storage layer.

Re: Compute pairwise distance

2016-07-07 Thread Debasish Das
Hi Manoj, I have a spark meetup talk that explains the issues with dimsum where you have to calculate row similarities. You can still use the PR since it has all the code you need but I have not got time to refactor it for the merge. I believe few kernels are supported as well. Thanks. Deb On

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Chanh Le
Hi Ayan, Thanks for replying. It’s sound great. Let me check. One thing confuse is there any way to share things between too? I mean Zeppelin and Thift Server. For example: I update, insert data to a table on Zeppelin and external tool connect through STS can get it. Thanks & regards, Chanh

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Chanh Le
Hi Mich, Actually technical users they can write some kind of complex machine learning things in the future too so that why zeppelin is promising. > Those business users. Do they Oracle BI (OBI) to connect to DW like Oracle > now? Yes, they are. Our data is still storing in Oracle but It’s

Re: Spark as sql engine on S3

2016-07-07 Thread ayan guha
Spark Thrift Server..works as jdbc server. you can connect to it from any jdbc tool like squirrel On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar wrote: > Hello gurus, > > We are storing data externally on Amazon S3 > > What is the optimum or best way to use Spark

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread ayan guha
Hi Spark Thrift does not need Hive/hadoop. STS should be your first choice if you are planning to integrate BI tools with Spark. It works with Zeppelin as well. We do all our development using Zeppelin and STS. One thing to note: many BI tools like Qliksense, Tablaue (not sure of oracle Bi Tool)

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Mich Talebzadeh
Interesting Chanh Those business users. Do they Oracle BI (OBI) to connect to DW like Oracle now? Certainly power users can use Zeppelin to write code that will be executed through Spark but much doubt Zeppelin can do what OBI tool provides. What you need is to investigate if OBI tool can

RE: Processing json document

2016-07-07 Thread Hyukjin Kwon
Yea, I totally agree with Yong. Anyway, this might not be a great idea but you might want to take a look this, http://pivotal-field-engineering.github.io/pmr-common/pmr/apidocs/com/gopivotal/mapreduce/lib/input/JsonInputFormat.html This does not recognise nested structure but I assume you might

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Chanh Le
Hi Mich, Thanks for replying. Currently we think we need to separate 2 groups of user. 1. Technical: Can write SQL 2. Business: Can drag and drop fields or metrics and see the result. Our stack using Zeppeline, Spark SQL to query data from Alluxio. Our data current store in parquet files.

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Mich Talebzadeh
hi, I have not used Alluxio but it is a distributed file system much like an IMDB say Oracle TimesTen. Spark is your query tool and Zeppelin is the GUI interface to your Spark which basically allows you graphs with Spark queries. You mentioned Hive so I assume your persistent storage is Hive?

Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Chanh Le
Hi everyone, Currently we use Zeppelin to analytics our data and because of using SQL it’s hard to distribute for users use. But users are using some kind of Oracle BI tools to analytic because it support some kinds of drag and drop and we can do some kind of permitted for each user. Our

Re: Compute pairwise distance

2016-07-07 Thread Manoj Awasthi
Hi Debasish, All, I see the status of SPARK-4823 [0] is "in-progress" still. I couldn't gather from the relevant pull request [1] if part of it is already in 1.6.0 (it's closed now). We are facing the same problem of computing pairwise distances between vectors where rows are > 5M and columns in

Re: stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
The N is much bigger than 1 in my case. Here is an example describes my issue. "select column1, stddev_samp(column2) from table1 group by column1" gives NaN "select column1, cast(stddev_samp(column2) as decimal(16,3)) from table1 group by column1" gives numeric values. e.g. 234.234 "select

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Andy Davidson
Kafka has an interesting model that might be applicable. You can think of kafka as enabling a queue system. Writes are called producers, and readers are called consumers. The server is called a broker. A ³topic² is like a named queue. Producer are independent. They can write to a ³topic² at

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Michael Armbrust
We are planning to address this issue in the future. At a high level, we'll have to add a delta mode so that updates can be communicated from one operator to the next. On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly wrote: > Indeed. But nested aggregation does not work

is dataframe.write() async? Streaming performance problem

2016-07-07 Thread Andy Davidson
I am running Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 and using kafka direct stream approach. I am running into performance problems. My processing time is > than my window size. Changing window sizes, adding cores and executor memory does not change performance. I am having a lot of

spark read from http endpoint?

2016-07-07 Thread Robert Towne
Do any of the Spark SQL 1.x or 2.0 api’s allow reading from a rest endpoint consuming json or xml? For example, in the 2.0 context I’ve tried the following attempts with varying errors: val conf = new SparkConf().setAppName("http test").setMaster("local[2]") val builder =

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-07 Thread Ashok Kumar
Thanks. Will this presentation recorded as well? Regards On Wednesday, 6 July 2016, 22:38, Mich Talebzadeh wrote: Dear forum members I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, your mileage varies" in Future of Data: London 

Re: spark classloader question

2016-07-07 Thread Chen Song
Thanks Prajwal. I tried these options and they make no difference. On Thu, Jul 7, 2016 at 12:20 PM Prajwal Tuladhar wrote: > You can try to play with experimental flags [1] > `spark.executor.userClassPathFirst` > and `spark.driver.userClassPathFirst`. But this can also

Re: spark classloader question

2016-07-07 Thread Chen Song
Thanks Marco The code snippet has something like below. ClassLoader cl = Thread.currentThread().getContextClassLoader(); String packagePath = "com.xxx.xxx"; final Enumeration resources = cl.getResources(packagePath); So resources collection is always empty, indicating no classes are loaded. As

Spark as sql engine on S3

2016-07-07 Thread Ashok Kumar
Hello gurus, We are storing data externally on Amazon S3 What is the optimum or best way to use Spark as SQL engine to access data on S3? Any info/write up will be greatly appreciated. Regards

RE: Processing json document

2016-07-07 Thread Yong Zhang
The problem is for Hadoop Input format to identify the record delimiter. If the whole json record is in one line, then the nature record delimiter will be the new line character. Keep in mind in distribute file system, the file split position most likely IS not on the record delimiter. The

Re: Processing json document

2016-07-07 Thread Lan Jiang
Hi, there, Thank you all for your input. @Hyukjin, as a matter of fact, I have read the blog link you posted before asking the question on the forum. As you pointed out, the link uses wholeTextFiles(0, which is bad in my case, because my json file can be as large as 20G+ and OOM might occur. I am

Re: spark classloader question

2016-07-07 Thread Prajwal Tuladhar
You can try to play with experimental flags [1] `spark.executor.userClassPathFirst` and `spark.driver.userClassPathFirst`. But this can also potentially break other things (like: dependencies that Spark master required initializing overridden by Spark app and so on) so, you will need to verify.

Re: spark classloader question

2016-07-07 Thread Marco Mistroni
Hi Chen pls post 1 . snippet code 2. exception any particular reason why you need to load classes in other jars programmatically? Have you tried to build a fat jar with all the dependencies ? hth marco On Thu, Jul 7, 2016 at 5:05 PM, Chen Song wrote: > Sorry to spam

Re: spark classloader question

2016-07-07 Thread Chen Song
Sorry to spam people who are not interested. Greatly appreciate it if anyone who is familiar with this can share some insights. On Wed, Jul 6, 2016 at 2:28 PM Chen Song wrote: > Hi > > I ran into problems to use class loader in Spark. In my code (run within > executor),

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Arnaud Bailly
Indeed. But nested aggregation does not work with Structured Streaming, that's the point. I would like to know if there is workaround, or what's the plan regarding this feature which seems to me quite useful. If the implementation is not overtly complex and it is just a matter of manpower, I am

Re: ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread Bryan Cutler
Can you try running the example like this ./bin/run-example sql.RDDRelation I know there are some jars in the example folders, and running them this way adds them to the classpath On Jul 7, 2016 3:47 AM, "kevin" wrote: > hi,all: > I build spark use: > >

Re: StreamingKmeans Spark doesn't work at all

2016-07-07 Thread Biplob Biswas
Hi, Can anyone care to please look into this issue? I would really love some assistance here. Thanks a lot. Thanks & Regards Biplob Biswas On Tue, Jul 5, 2016 at 1:00 PM, Biplob Biswas wrote: > > Hi, > > I implemented the streamingKmeans example provided in the

Spark 1.6.2 short circuit AND filter broken

2016-07-07 Thread Patrick Woody
Hey all, I hit a pretty nasty bug on 1.6.2 that I can't reproduce on 2.0. Here is the code/logical plan http://pastebin.com/ULnHd1b6. I have filterPushdown disabled, so when I call collect here it hits the Exception in my UDF before doing a null check on the input. I believe it is a symptom of

Re: Extend Dataframe API

2016-07-07 Thread tan shai
That was what I am thinking to do. Do you have any idea about this? Or any documentations? Many thanks. 2016-07-07 17:07 GMT+02:00 Koert Kuipers : > i dont see any easy way to extend the plans, beyond creating a custom > version of spark. > > On Thu, Jul 7, 2016 at 9:31 AM,

Re: Question regarding structured data and partitions

2016-07-07 Thread tan shai
Thank you for your answer. Since Spark 1.6.0, it is possible to partition a dataframe using hash partitioning with Repartition " https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame " I have also sorted a dataframe and it using a range partitioning in the

Re: Extend Dataframe API

2016-07-07 Thread Koert Kuipers
i dont see any easy way to extend the plans, beyond creating a custom version of spark. On Thu, Jul 7, 2016 at 9:31 AM, tan shai wrote: > Hi, > > I need to add new operations to the dataframe API. > Can any one explain to me how to extend the plans of query execution? > >

Re: Question regarding structured data and partitions

2016-07-07 Thread Koert Kuipers
since dataframes represent more or less a plan of execution, they do not have partition information as such i think? you could however do dataFrame.rdd, to force it to create a physical plan that results in an actual rdd, and then query the rdd for partition info. On Thu, Jul 7, 2016 at 4:24 AM,

Re: One map per folder in spark or Hadoop

2016-07-07 Thread Deepak Sharma
You have to distribute the files in some distributed file system like hdfs. Or else copy the files to all executors local file system and make sure to mention the file scheme in the URI explicitly. Thanks Deepak On Thu, Jul 7, 2016 at 7:13 PM, Balachandar R.A. wrote:

Re: One map per folder in spark or Hadoop

2016-07-07 Thread Balachandar R.A.
Hi Thanks for the code snippet. If the executable inside the map process needs to access directories and files present in the local file system. Is it possible? I know they are running in slave node in a temporary working directory and i can think about distributed cache. But still would like to

Extend Dataframe API

2016-07-07 Thread tan shai
Hi, I need to add new operations to the dataframe API. Can any one explain to me how to extend the plans of query execution? Many thanks.

Re: Spark with HBase Error - Py4JJavaError

2016-07-07 Thread ram kumar
Hi Puneet, Have you tried appending --jars $SPARK_HOME/lib/spark-examples-*.jar to the execution command? Ram On Thu, Jul 7, 2016 at 5:19 PM, Puneet Tripathi < puneet.tripa...@dunnhumby.com> wrote: > Guys, Please can anyone help on the issue below? > > > > Puneet > > > > *From:* Puneet

categoricalFeaturesInfo

2016-07-07 Thread pseudo oduesp
Hi, how i can use this option in Random Forest . when i transform my vector (100 features ) i have 20 categoriel feature include. if i understand categorielFeatureinfo , i should past the position of my 20 categoriels feature inside of the vector containing 100 with map{ positionof feature

Re: Re: how to select first 50 value of each group after group by?

2016-07-07 Thread Mich Talebzadeh
Hi, Your statement below: sortedDF.registerTempTable("sortedDF") val top50 = hc.sql("select id,location from sortedDF where rowNumber<=50") Is on Hive table. Try sortedDF.registerTempTable("sortedDF") val top50 = hc.sql("select id,location from sortedDF* LIMIT 50")* rowNumber

Re: Re: how to select first 50 value of each group after group by?

2016-07-07 Thread Anton Okolnychyi
Hi, I can try to guess what is wrong, but I might be incorrect. You should be careful with window frames (you define them via the rowsBetween() method). In my understanding, all window functions can be divided into 2 groups: - functions defined by the

Re: RDD and Dataframes

2016-07-07 Thread Bruno Costa
Thank you for the answer. One of the optimizations of Dataframes/Datasets (beyond the Catalyst) are the Encoders (Project Tungsten), which translate domain objects into Spark's internal format (binary). By using encoders, the data is not managed by the Java Virtual Machine anymore (which increase

Re: problem extracting map from json

2016-07-07 Thread Sivakumaran S
Hi Michal, Will an example help? import scala.util.parsing.json._//Requires scala-parsec-combinators because it is no longer part of core scala val wbJSON = JSON.parseFull(weatherBox) //wbJSON is a JSON object now //Depending on the structure, now traverse through the object val

problem extracting map from json

2016-07-07 Thread Michal Vince
Hi guys I`m trying to extract Map[String, Any] from json string, this works well in any scala repl I tried, both scala 2.11 and 2.10 and using both json4s and liftweb-json libraries, but if I try to do the same thing in spark-shell I`m always getting |No information known about type...|

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Sivakumaran S
Arnauld, You could aggregate the first table and then merge it with the second table (assuming that they are similarly structured) and then carry out the second aggregation. Unless the data is very large, I don’t see why you should persist it to disk. IMO, nested aggregation is more elegant

Re: RDD and Dataframes

2016-07-07 Thread Rishi Mishra
Yes, finally it will be converted to an RDD internally. However DataFrame queries are passed through catalyst , which provides several optimizations e.g. code generation, intelligent shuffle etc , which is not the case for pure RDDs. Regards, Rishitesh Mishra, SnappyData .

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Arnaud Bailly
It's aggregation at multiple levels in a query: first do some aggregation on one tavle, then join with another table and do a second aggregation. I could probably rewrite the query in such a way that it does aggregation in one pass but that would obfuscate the purpose of the various stages. Le 7

RE: Spark with HBase Error - Py4JJavaError

2016-07-07 Thread Puneet Tripathi
Guys, Please can anyone help on the issue below? Puneet From: Puneet Tripathi [mailto:puneet.tripa...@dunnhumby.com] Sent: Thursday, July 07, 2016 12:42 PM To: user@spark.apache.org Subject: Spark with HBase Error - Py4JJavaError Hi, We are running Hbase in fully distributed mode. I tried to

Re: Optimize filter operations with sorted data

2016-07-07 Thread tan shai
How can you verify that it is loading only the part of time and network in filter ? 2016-07-07 11:58 GMT+02:00 Chanh Le : > Hi Tan, > It depends on how data organise and what your filter is. > For example in my case: I store data by partition by field time and > network_id.

Re: Optimize filter operations with sorted data

2016-07-07 Thread tan shai
Yes it is operating on the sorted column 2016-07-07 11:43 GMT+02:00 Ted Yu : > Does the filter under consideration operate on sorted column(s) ? > > Cheers > > > On Jul 7, 2016, at 2:25 AM, tan shai wrote: > > > > Hi, > > > > I have a sorted

回复:Re: how to select first 50 value of each group after group by?

2016-07-07 Thread luohui20001
hi Anton: I check the docs you mentioned, and have code accordingly, however met an exception like "org.apache.spark.sql.AnalysisException: Window function row_number does not take a frame specification.;" It Seems that the row_number API is giving a global row numbers of every row

RDD and Dataframes

2016-07-07 Thread brccosta
Dear guys, I'm investigating the differences between RDDs and Dataframes/Datasets. I couldn't find the answer for this question: Dataframes acts as a new layer in the Spark stack? I mean, in the execution there is a conversion to RDD? For example, if I create a Dataframe and perform a query, in

Re: Graphframe Error

2016-07-07 Thread Arun Patel
I have tied this already. It does not work. What version of Python is needed for this package? On Wed, Jul 6, 2016 at 12:45 AM, Felix Cheung wrote: > This could be the workaround: > > http://stackoverflow.com/a/36419857 > > > > > On Tue, Jul 5, 2016 at 5:37 AM

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-07 Thread Rabin Banerjee
In that case, I suspect that Mqtt is not getting data while you are submitting in yarn cluster . Can you please try dumping data in text file instead of printing while submitting in yarn cluster mode.? On Jul 7, 2016 12:46 PM, "Yu Wei" wrote: > Yes. Thanks for your

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Sivakumaran S
Hi Arnauld, Sorry for the doubt, but what exactly is multiple aggregation? What is the use case? Regards, Sivakumaran > On 07-Jul-2016, at 11:18 AM, Arnaud Bailly wrote: > > Hello, > > I understand multiple aggregations over streaming dataframes is not currently >

Re: Spark streaming Kafka Direct API + Multiple consumers

2016-07-07 Thread Rabin Banerjee
It's not required , *Simplified Parallelism:* No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one

ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread kevin
hi,all: I build spark use: ./make-distribution.sh --name "hadoop2.7.1" --tgz "-Pyarn,hadoop-2.6,parquet-provided,hive,hive-thriftserver" -DskipTests -Dhadoop.version=2.7.1 I can run example : ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master spark://master1:7077 \

Multiple aggregations over streaming dataframes

2016-07-07 Thread Arnaud Bailly
Hello, I understand multiple aggregations over streaming dataframes is not currently supported in Spark 2.0. Is there a workaround? Out of the top of my head I could think of having a two stage approach: - first query writes output to disk/memory using "complete" mode - second query reads from

Re: Spark streaming. Strict discretizing by time

2016-07-07 Thread rss rss
Hi, I changed auto.offset.reset to largest. The result 30, 50, 40, 40, 35, 30 seconds... Instead of 10 seconds. It looks like attempt to react on backpressure but very slow. In any case it is far from any real time tasks including soft real time. And ok, I agreed with Spark usage with data

Re: Optimize filter operations with sorted data

2016-07-07 Thread Chanh Le
Hi Tan, It depends on how data organise and what your filter is. For example in my case: I store data by partition by field time and network_id. If I filter by time or network_id or both and with other field Spark only load part of time and network in filter then filter the rest. > On Jul 7,

Re: Optimize filter operations with sorted data

2016-07-07 Thread Ted Yu
Does the filter under consideration operate on sorted column(s) ? Cheers > On Jul 7, 2016, at 2:25 AM, tan shai wrote: > > Hi, > > I have a sorted dataframe, I need to optimize the filter operations. > How does Spark performs filter operations on sorted dataframe? >

Spark streaming Kafka Direct API + Multiple consumers

2016-07-07 Thread SamyaMaiti
Hi Team, Is there a way we can consume from Kafka using spark Streaming direct API using multiple consumers (belonging to same consumer group) Regards, Sam -- View this message in context:

Optimize filter operations with sorted data

2016-07-07 Thread tan shai
Hi, I have a sorted dataframe, I need to optimize the filter operations. How does Spark performs filter operations on sorted dataframe? It is scanning all the data? Many thanks.

Re: stddev_samp() gives NaN

2016-07-07 Thread Sean Owen
Sample standard deviation can't be defined in the case of N=1, because it has N-1 in the denominator. My guess is that this is the case you're seeing. A population of N=1 still has a standard deviation of course (which is 0). On Thu, Jul 7, 2016 at 9:51 AM, Mungeol Heo

Re: stddev_samp() gives NaN

2016-07-07 Thread Sean Owen
The OP is not calling stddev though, so I still don't see that this is the question at hand. But while we're off on the topic -- while I certainly agree that stddev is mapped to the sample standard deviation in DBs, it doesn't actually make much sense as a default. What you get back is not the

Re: stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
I know stddev_samp and stddev_pop gives different values, because they have different definition. What I want to know is why stddev_samp gives "NaN", and not a numeric value. On Thu, Jul 7, 2016 at 5:39 PM, Sean Owen wrote: > I don't think that's relevant here. The question

Re: stddev_samp() gives NaN

2016-07-07 Thread Mich Talebzadeh
stddev is mapped to stdddev_samp. That is the general use case or rather common use of standard deviation. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: stddev_samp() gives NaN

2016-07-07 Thread Sean Owen
I don't think that's relevant here. The question is why would samp give a different result to pop, not the result of "stddev". Neither one is a 'correct' definition of standard deviation in the abstract; one or the other is correct depending on what standard deviation you are trying to measure.

Re: stddev_samp() gives NaN

2016-07-07 Thread Mich Talebzadeh
The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP. That is the correct one. You can actually work that one out yourself BTW Hive also gives a wrong value. This is what I reported back in April about Hive giving incorrect value Both Oracle and Sybase point STDDEV to STDDEV_SAMP

Re: stddev_samp() gives NaN

2016-07-07 Thread Sean Owen
No, because these are different values defined differently. If you have 1 data point, the sample stdev is undefined while population stdev is defined. Refer to their definition. On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo wrote: > Hello, > > As I mentioned at the title,

Re: Question regarding structured data and partitions

2016-07-07 Thread tan shai
Using partitioning with dataframes, how can we retrieve informations about partitions? partitions bounds for example Thanks, Shaira 2016-07-07 6:30 GMT+02:00 Koert Kuipers : > spark does keep some information on the partitions of an RDD, namely the > partitioning/partitioner.

stddev_samp() gives NaN

2016-07-07 Thread Mungeol Heo
Hello, As I mentioned at the title, stddev_samp function gives a NaN while stddev_pop gives a numeric value on the same data. The stddev_samp function will give a numeric value, if I cast it to decimal. E.g. cast(stddev_samp(column_name) as decimal(16,3)) Is it a bug? Thanks - mungeol

Re: Structured Streaming Comparison to AMPS

2016-07-07 Thread Arnaud Bailly
Hi, I am also interested in having JOIN support streams on both sides. I understand Spark SQL's JOIN currently support having a stream on a single side only. Could you please provide some more details on why this is the case? What are the technical limitations that make it harder to implement

Re: Structured Streaming Comparison to AMPS

2016-07-07 Thread Tathagata Das
We will look into streaming-streaming joins in future release of Spark, though no promises on any timeline yet. We are currently fighting to get Spark 2.0 out of the door. There isnt a JIRA for this right now. However, you can track the Structured Streaming Epic JIRA to track whats going on. I try

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-07 Thread Yu Wei
Yes. Thanks for your clarification. The problem I encountered is that in yarn cluster mode, no output for "DStream.print()" in yarn logs. In spark implementation org/apache/spark/streaming/dstream/DStream.scala, the logs related with "Time" was printed out. However, other information for

Spark with HBase Error - Py4JJavaError

2016-07-07 Thread Puneet Tripathi
Hi, We are running Hbase in fully distributed mode. I tried to connect to Hbase via pyspark and then write to hbase using saveAsNewAPIHadoopDataset , but it failed the error says: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset. :

SPARK-8813 - combining small files in spark sql

2016-07-07 Thread Ajay Srivastava
Hi, This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark 2.0.But resolution is not mentioned there. In our use case, there are big as well as many small parquet files which are being queried using spark sql.Can someone please explain what is the fix and how I can use it

Re: Processing json document

2016-07-07 Thread Hyukjin Kwon
The link uses wholeTextFiles() API which treats each file as each record. 2016-07-07 15:42 GMT+09:00 Jörn Franke : > This does not need necessarily the case if you look at the Hadoop > FileInputFormat architecture then you can even split large multi line Jsons > without

Re: Processing json document

2016-07-07 Thread Jörn Franke
This does not need necessarily the case if you look at the Hadoop FileInputFormat architecture then you can even split large multi line Jsons without issues. I would need to have a look at it, but one large file does not mean one Executor independent of the underlying format. > On 07 Jul 2016,

SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-07 Thread linxi zeng
Hi, all: As recorded in https://issues.apache.org/jira/browse/SPARK-16408, when using Spark-sql to execute sql like: add file hdfs://xxx/user/test; If the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an exception like: org.apache.spark.SparkException: Added file

Re: MLLib SVMWithSGD is failing for large dataset

2016-07-07 Thread Chitturi Padma
Hi Sarath, By any chance have you resolved this issue ? Thanks, Padma CH On Tue, Apr 28, 2015 at 11:20 PM, sarath [via Apache Spark User List] < ml-node+s1001560n22694...@n3.nabble.com> wrote: > > I am trying to train a large dataset consisting of 8 million data points > and 20 million

Re: Processing json document

2016-07-07 Thread Hyukjin Kwon
There is a good link for this here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files If there are a lot of small files, then it would work pretty okay in a distributed manner, but I am worried if it is single large file. In this case, this would only work in

Re: Processing json document

2016-07-07 Thread Jean Georges Perrin
do you want id1, id2, id3 to be processed similarly? The Java code I use is: df = df.withColumn(K.NAME, df.col("fields.premise_name")); the original structure is something like {"fields":{"premise_name":"ccc"}} hope it helps > On Jul 7, 2016, at 1:48 AM, Lan Jiang