Re: Re: how to extract arraytype data to file

2016-10-18 Thread lk_spark
Thank you, all of you. explode() is helpful: df.selectExpr("explode(bizs) as e").select("e.*").show() 2016-10-19 lk_spark 发件人:Hyukjin Kwon 发送时间:2016-10-19 13:16 主题:Re: how to extract arraytype data to file 收件人:"Divya Gehlot"

Re: Deep learning libraries for scala

2016-10-18 Thread Edward Fine
How about https://deeplearning4j.org/ ? On Wed, Oct 5, 2016 at 9:25 AM janardhan shetty wrote: > Any help from the experts regarding this is appreciated > On Oct 3, 2016 1:45 PM, "janardhan shetty" wrote: > > Thanks Ben. The current spark ML

Re: how to extract arraytype data to file

2016-10-18 Thread Hyukjin Kwon
This reminds me of https://github.com/databricks/spark-xml/issues/141#issuecomment-234835577 Maybe using explode() would be helpful. Thanks! 2016-10-19 14:05 GMT+09:00 Divya Gehlot : > http://stackoverflow.com/questions/33864389/how-can-i- >

Re: how to extract arraytype data to file

2016-10-18 Thread Divya Gehlot
http://stackoverflow.com/questions/33864389/how-can-i-create-a-spark-dataframe-from-a-nested-array-of-struct-element Hope this helps Thanks, Divya On 19 October 2016 at 11:35, lk_spark wrote: > hi,all: > I want to read a json file and search it by sql . > the data struct

RE: how to extract arraytype data to file

2016-10-18 Thread Kappaganthu, Sivaram (ES)
There is an option called Explode for this . From: lk_spark [mailto:lk_sp...@163.com] Sent: Wednesday, October 19, 2016 9:06 AM To: user.spark Subject: how to extract arraytype data to file hi,all: I want to read a json file and search it by sql . the data struct should be : bid: string

how to extract arraytype data to file

2016-10-18 Thread lk_spark
hi,all: I want to read a json file and search it by sql . the data struct should be : bid: string (nullable = true) code: string (nullable = true) and the json file data should be like : {bid":"MzI4MTI5MzcyNw==","code":"罗甸网警"} {"bid":"MzI3MzQ5Nzc2Nw==","code":"西早君"} but in fact my json

hive.exec.stagingdir not effect in spark2.0.1

2016-10-18 Thread 谭 成灶
Hi , I have set property "hive.exec.stagingdir" to hdfs dir "/tmp/spark_log/${user.name}/.hive-staging" in hive-site.xml ,but it not effect in spark2.0.1, directory name that still be created inside table locations .It works in spark 1.6,and creates hive-staging files in hdfs dir

How does Spark determine in-memory partition count when reading Parquet ~files?

2016-10-18 Thread shea.parkes
When reading a parquet ~file with >50 parts, Spark is giving me a DataFrame object with far fewer in-memory partitions. I'm happy to troubleshoot this further, but I don't know Scala well and could use some help pointing me in the right direction. Where should I be looking in the code base to

Re: Broadcasting Non Serializable Objects

2016-10-18 Thread Daniel Imberman
Hi Pedro, Can you please post your code? Daniel On Tue, Oct 18, 2016 at 12:27 PM pedroT wrote: > Hi guys. > > I know this is a well known topic, but reading about (a lot) I'm not sure > about the answer.. > > I need to broadcast a complex estructure with a lot of objects

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Hyukjin Kwon
Regarding his recent PR[1], I guess he meant multiple line json. As far as I know, single line json also conplies the standard. I left a comment with RFC in the PR but please let me know if I am wrong at any point. Thanks! [1]https://github.com/apache/spark/pull/15511 On 19 Oct 2016 7:00 a.m.,

Re: spark with kerberos

2016-10-18 Thread Michael Segel
(Sorry sent reply via wrong account.. ) Steve, Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-) Usually you will end up having a local Kerberos set up per cluster. So your machine accounts (hive, yarn, hbase, etc …) are going to be local to the cluster. So you

Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-18 Thread Marcelo Vanzin
On Tue, Oct 18, 2016 at 3:01 PM, Elkhan Dadashov wrote: > Does my map task need to wait until Spark job finishes ? No... > Or is there any way, my map task finishes after launching Spark job, and I > can still query and get status of Spark job outside of map task (or

Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-18 Thread Elkhan Dadashov
Hi, Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ? 1) Currently, I have mapper tasks, which launches Spark job via SparkLauncer#startApplication() Does my map task need to wait until Spark job finishes ? Or is there any way, my map task finishes

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Daniel Barclay
Koert, Koert Kuipers wrote: A single json object would mean for most parsers it needs to fit in memory when reading or writing Note that codlife didn't seem to being asking about /single-object/ JSON files, but about /standard-format/ JSON files. On Oct 15, 2016 11:09, "codlife"

Re: Aggregate UDF (UDAF) in Python

2016-10-18 Thread ayan guha
Is it possible to use aggregate by function available at rdd level to do similar stuff? On 19 Oct 2016 04:41, "Tobi Bosede" wrote: > Thanks Assaf. > > This is a lot more complicated than I was expecting...might end up using > collect if data fits in memory. I was also

Re: jdbcRDD for data ingestion from RDBMS

2016-10-18 Thread Mich Talebzadeh
Hi, If we are talking about billions of records and depending on your network and RDBMs with parallel connections, from my experience it works OK for Dimension tables of moderate size, in that you can have parallel connections to RDBMS (assuming the RDBMS has a primary key/unique column) to

spark streaming client program needs to be restarted after few hours of idle time. how can I fix it?

2016-10-18 Thread kant kodali
Hi Guys, My Spark Streaming Client program works fine as the long as the receiver receives the data but say my receiver has no more data to receive for few hours like (4-5 hours) and then its starts receiving the data again at that point spark client program doesn't seem to process any data. It

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Steve Loughran
On 18 Oct 2016, at 08:43, Chetan Khatri > wrote: Hello Community members, I am getting error while reading large JSON file in spark, the underlying read code can't handle more than 2^31 bytes in a single line: if (bytesConsumed >

Re: Making more features in Logistic Regression

2016-10-18 Thread Nick Pentreath
You can use the PolynomialExpansion in Spark ML ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion ) On Tue, 18 Oct 2016 at 21:47 miro wrote: > Yes, I was thinking going down this road: > > >

Re: spark with kerberos

2016-10-18 Thread Steve Loughran
On 17 Oct 2016, at 22:11, Michael Segel > wrote: @Steve you are going to have to explain what you mean by ‘turn Kerberos on’. Taken one way… it could mean making cluster B secure and running Kerberos and then you’d have to create

Re: Making more features in Logistic Regression

2016-10-18 Thread miro
Yes, I was thinking going down this road: http://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions

Broadcasting Non Serializable Objects

2016-10-18 Thread pedroT
Hi guys. I know this is a well known topic, but reading about (a lot) I'm not sure about the answer.. I need to broadcast a complex estructure with a lot of objects as fields, including some of external libraries which I can't happily turn in serializable ones. I tried making a static method

Re: Making more features in Logistic Regression

2016-10-18 Thread aditya1702
-- View this message in context:

Re: Making more features in Logistic Regression

2016-10-18 Thread aditya1702
Here is the graph and the features with their corresponding data -- View this message in

Re: Making more features in Logistic Regression

2016-10-18 Thread miro
Hi, I think it depends on how non-linear data you have. You could add polynomial to your model,..but everything depends on your data. If you could share more details maybe a scatter plot, would help to investigate the problem further. All the best, Miro > On 18 Oct 2016, at 19:09,

Re: Aggregate UDF (UDAF) in Python

2016-10-18 Thread Tobi Bosede
Thanks Assaf. This is a lot more complicated than I was expecting...might end up using collect if data fits in memory. I was also thinking about using the pivot function in pandas, but that wouldn't work in parallel and so would be even more inefficient than collect. On Tue, Oct 18, 2016 at 7:24

How to add all jars in a folder to executor classpath?

2016-10-18 Thread nitinkak001
I need to add all the jars in hive/lib to my spark job executor classpath. I tried this spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib and spark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib/* but it does not

Making more features in Logistic Regression

2016-10-18 Thread aditya1702
Hello, I am trying to solve a problem of Logistic Regression using Spark. I am still a newbie to machine learning. I wanted to ask that if I have 2 features for logistic regression and if the features are non-linear (regularized logistic regression) do we have to make more features by considering

Re: Spark Streaming 2 Kafka 0.10 Integration for Aggregating Data

2016-10-18 Thread Sean Owen
Try adding the spark-streaming_2.11 artifact as a dependency too. You will be directly depending on it. On Tue, Oct 18, 2016 at 2:16 PM Furkan KAMACI wrote: > Hi, > > I have a search application and want to monitor queries per second for it. > I have Kafka at my backend

Spark Streaming 2 Kafka 0.10 Integration for Aggregating Data

2016-10-18 Thread Furkan KAMACI
Hi, I have a search application and want to monitor queries per second for it. I have Kafka at my backend which acts like a bus for messages. Whenever a search request is done I publish the nano time of the current system. I want to use Spark Streaming to aggregate such data but I am so new to

Re: mllib model in production web API

2016-10-18 Thread Aseem Bansal
Hi Vincent I am not sure whether you are asking me or Nicolas. If me, then no we didn't. Never used Akka and wasn't even aware that it has such capabilities. Using Java API so we don't have Akka as a dependency right now. On Tue, Oct 18, 2016 at 12:47 PM, vincent gromakowski <

RE: Aggregate UDF (UDAF) in Python

2016-10-18 Thread Mendelson, Assaf
A simple example: We have a scala file: package com.myorg.example import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.functions.{rand, sum} import

Re: Contributing to PySpark

2016-10-18 Thread Holden Karau
Hi Krishna, Thanks for your interest contributing to PySpark! I don't personally use either of those IDEs so I'll leave that part for someone else to answer - but in general you can find the building spark documentation at http://spark.apache.org/docs/latest/building-spark.html which includes

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Chetan Khatri
Dear Xi shen, Thank you for getting back to question. The approach i am following are as below: I have MSSQL server as Enterprise data lack. 1. run Java jobs and generated JSON files, every file is almost 6 GB. *Correct spark need every JSON on **separate line, so i did * sed -e 's/}/}\n/g' -s

Re: jdbcRDD for data ingestion from RDBMS

2016-10-18 Thread Teng Qiu
Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional data source during the data processing, main goal of spark is still analyzing data from HDFS-like file system. to use spark as a data integration tool to transfer billions of records from RDBMS to HDFS etc. could work, but

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Xi Shen
It is a plain Java IO error. Your line is too long. You should alter your JSON schema, so each line is a small JSON object. Please do not concatenate all the object into an array, then write the array in one line. You will have difficulty handling your super large JSON array in Spark anyway.

Contributing to PySpark

2016-10-18 Thread Krishna Kalyan
Hello, I am a masters student. Could someone please let me know how set up my dev working environment to contribute to pyspark. Questions I had were: a) Should I use Intellij Idea or PyCharm?. b) How do I test my changes?. Regards, Krishna

Re: Help in generating unique Id in spark row

2016-10-18 Thread ayan guha
Do you have any primary key or unique identifier in your data? Even if multiple columns can make a composite key? In other words, can your data have exactly same 2 rows with different unique ID? Also, do you have to have numeric ID? You may want to pursue hashing algorithm such as sha group to

Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-18 Thread 市场部
Hi YanBo Thank you very much. You are totally correct! I just looked up spark document of 2.0.1. It says that "Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. (default = 256 MB)” Although

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-18 Thread Mich Talebzadeh
The design really needs to look at other stack as well. If the visualisation layer is going to use Tableau then you cannot use Spark functional programming. Only Spark SQL or anything that works with SQL like Hive or Phoenix. Tableau is not a real time dashboard so for analytics it maps tables

Re: tutorial for access elements of dataframe columns and column values of a specific rows?

2016-10-18 Thread Divya Gehlot
Can you please elaborate your use case ? On 18 October 2016 at 15:48, muhammet pakyürek wrote: > > > > > > -- > *From:* muhammet pakyürek > *Sent:* Monday, October 17, 2016 11:51 AM > *To:* user@spark.apache.org > *Subject:*

tutorial for access elements of dataframe columns and column values of a specific rows?

2016-10-18 Thread muhammet pakyürek
From: muhammet pakyürek Sent: Monday, October 17, 2016 11:51 AM To: user@spark.apache.org Subject: rdd and dataframe columns dtype how can i set columns dtype of rdd

About Error while reading large JSON file in Spark

2016-10-18 Thread Chetan Khatri
Hello Community members, I am getting error while reading large JSON file in spark, *Code:* val landingVisitor = sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json") *Error:* 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8) java.io.IOException: Too

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-18 Thread Jörn Franke
Careful Hbase with Phoenix is only in certain scenarios faster. When it is about processing small amounts out of a bigger amount of data (depends on node memory, the operation etc). Hive+tez+orc can be rather competitive, llap makes sense for interactive ad-hoc queries that are rather

Re: mllib model in production web API

2016-10-18 Thread vincent gromakowski
Hi Did you try applying the model with akka instead of spark ? https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/ Le 18 oct. 2016 5:58 AM, "Aseem Bansal" a écrit : > @Nicolas > > No, ours is different. We required predictions within