SparkSQL with large result size

2016-05-01 Thread Buntu Dev
I got a 10g limitation on the executors and operating on parquet dataset with block size 70M with 200 blocks. I keep hitting the memory limits when doing a 'select * from t1 order by c1 limit 100' (ie, 1M). It works if I limit to say 100k. What are the options to save a large dataset without

using amazon STS with spark

2016-05-01 Thread Luke Rohde
Hi - I'm using s3 storage with spark and would like to use AWS credentials provided by STS to authenticate. I'm doing the following to use those credentials: val hadoopConf = sc.hadoopConfiguration hadoopConf.set("fs.s3.awsAccessKeyId",credentials.getAccessKeyId)

Re: Spark on AWS

2016-05-01 Thread Teng Qiu
Hi, here we made several optimizations for accessing s3 from spark: https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando such as: https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133 you can deploy

Is DataFrame randomSplit Deterministic?

2016-05-01 Thread Brandon White
If I have the same data, the same ratios, and same sample seed, will I get the same splits every time?

Re: Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Couldnot transfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2):repo1.maven.org: unkn

2016-05-01 Thread Ted Yu
bq. Caused by: Compile failed via zinc server Looks like Zinc got in the way of compilation. Consider stopping Zinc and do a clean build. On Sun, May 1, 2016 at 8:35 AM, sunday2000 <2314476...@qq.com> wrote: > Error message: > [debug] External API changes: API Changes: Set() > [debug] Modified

?????? Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Couldnot transfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2):repo1.maven.org: u

2016-05-01 Thread sunday2000
Error message: [debug] External API changes: API Changes: Set() [debug] Modified binary dependencies: Set() [debug] Initial directly invalidated sources: Set(/root/dl/spark-1.6.1/tags/src/main/java/org/apache/spark/tags/DockerTest.java,

?????? Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Couldnot transfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2):repo1.maven.org: u

2016-05-01 Thread sunday2000
Downloading: https://repository.apache.org/content/repositories/releases/org/apache/apache/14/apache-14.pom Downloading: https://repository.jboss.org/nexus/content/repositories/releases/org/apache/apache/14/apache-14.pom Downloading:

Re: Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Could not transfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2): repo1.maven.org: un

2016-05-01 Thread Ted Yu
FYI Accessing the link below gave me 'Page does not exist' I am in California. I checked the dependency tree of 1.6.1 - I didn't see such dependence. Can you pastebin related maven output ? Thanks On Sun, May 1, 2016 at 6:32 AM, sunday2000 <2314476...@qq.com> wrote: > Seems it is because

Re: Error in spark-xml

2016-05-01 Thread Hyukjin Kwon
To be more clear, If you set the rowTag as "book", then it will produces an exception which is an issue opened here, https://github.com/databricks/spark-xml/issues/92 Currently it does not support to parse a single element with only a value as a row. If you set the rowTag as "bkval", then it

?????? Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Could not transfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2): repo1.maven.org:

2016-05-01 Thread sunday2000
Seems it is because fail to download this url: http://maven.twttr.com/org/apache/apache/14/apache-14.pom -- -- ??: "Ted Yu";; : 2016??5??1??(??) 9:27 ??: "sunday2000"<2314476...@qq.com>; :

Re: Can not import KafkaProducer in spark streaming job

2016-05-01 Thread Ted Yu
According to examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala : import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord} Can you give the command line you used to submit the job ? Probably classpath issue. On Sun, May 1, 2016 at

Re: Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2): repo1.maven.org:

2016-05-01 Thread Ted Yu
bq. Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1 Looks like you were using Spark 1.6.1 Can you check firewall settings ? I saw similar report from Chinese users. Consider using proxy. On Sun, May 1, 2016 at 4:19 AM, sunday2000 <2314476...@qq.com> wrote: > Hi, > We

Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2): repo1.maven.org: unkn

2016-05-01 Thread sunday2000
Hi, We are compiling spare 1.6.0 in a linux server, while getting this error message. Could you tell us how to solve it? thanks. [INFO] Scanning for projects... Downloading: https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom Downloading:

Can not import KafkaProducer in spark streaming job

2016-05-01 Thread fanooos
I have a very strange problem. I wrote a spark streaming job that monitor an HDFS directory, read the newly added files, and send the contents to Kafka. The job is written in python and you can got the code from this link http://pastebin.com/mpKkMkph When submitting the job I got that error

Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2): repo1.maven.org: unkn

2016-05-01 Thread sunday2000
Hi, We are compiling spare 1.6.0 in a linux server, while getting this error message. Could you tell us how to solve it? thanks. [INFO] Scanning for projects... Downloading: https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom Downloading:

Spark 1.6.1 issue fetching data via JDBC in Spark-shell

2016-05-01 Thread Mich Talebzadeh
Hi, This sounds like a problem introduced in spark-shell 1.6.1. Objective: Use JDBC connection in Spark shell to get data from RDBMS table (in this case Oracle) Results: JDBC connection is made OK but the collection fails with error ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times;

Re: Bit(N) on create Table with MSSQLServer

2016-05-01 Thread Mich Talebzadeh
Well if MSSQL cannot create that column then it is more like compatibility between Spark and RDBMS. What value that column has in MSSQL. Can you create table the table in MSSQL database or map it in Spark to a valid column before opening JDBC connection? HTH Dr Mich Talebzadeh LinkedIn *

Re: Error in spark-xml

2016-05-01 Thread Hyukjin Kwon
Hi Sourav, I think it is an issue. XML will assume the element by the rowTag as object. Could you please open an issue in https://github.com/databricks/spark-xml/issues please? Thanks! 2016-05-01 5:08 GMT+09:00 Sourav Mazumder : > Hi, > > Looks like there is a