Re: Read Data From NFS

2017-06-13 Thread Riccardo Ferrari
Hi Ayan, You might be interested in the official Spark docs: https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism and its spark.default.parallelism setting Best, On Mon, Jun 12, 2017 at 6:18 AM, ayan guha wrote: > I understand how it works with hdfs. My question is when hdfs is

Re: Read Data From NFS

2017-06-13 Thread ayan guha
Hi So, for example, if I specify parallelism to 100, 100 partitions will be created, right? My question is how spark divides the file? In other words, how does it specify first x lines will be read by first partition and further y lines will be read by second partition and so on? In case of hdfs,

Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-13 Thread Yong Zhang
The issue is cased by the data, and indeed a type miss match between Hive schema and Spark. Now it is fixed. Without that kind of data, the problem won't be trigged in some brands. Thanks taking a look of this problem. Yong From: ayan guha Sent: Tuesday, J

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-13 Thread Paolo Patierno
I think that a big advantage to not use Spark Streaming when your solution is already based on Kafka is that you don't have to deal with another cluster. I mean ... Imagine that your solution is already based on Kafka as ingestion systems for your events and then you need to do some real time an

Can I use ChannelTrafficShapingHandler to control the network read/write speed in shuffle?

2017-06-13 Thread Niu Zhaojie
Hi All: I am trying to control the network read/write speed with ChannelTrafficShapingHandler provided by Netty. In TransportContext.java I modify it as below: public TransportChannelHandler initializePipeline( SocketChannel channel, RpcHandler channelRpcHandler) { try {

Can I use ChannelTrafficShapingHandler to control the network read/write speed in shuffle?

2017-06-13 Thread zhaojie
Hi All: I am trying to control the network read/write speed with ChannelTrafficShapingHandler provided by Netty. In TransportContext.java I modify it as below: public TransportChannelHandler initializePipeline( SocketChannel channel, RpcHandler channelRpcHandler) { try {

Assign Custom receiver to a scheduler pool

2017-06-13 Thread Rabin Banerjee
HI All, I have a spark streaming job, which is running fine. And in that job I am using FAIR scheduler with multiple pools. One pool for main streaming job. And another for some parallel batch processing . All is working as expected, except .. "The custom receiver that I used is not assigned t

Fwd: Assign Custom receiver to a scheduler pool

2017-06-13 Thread Rabin Banerjee
HI All, I have a spark streaming job, which is running fine. And in that job I am using FAIR scheduler with multiple pools. One pool for main streaming job. And another for some parallel batch processing . All is working as expected, except .. "The custom receiver that I used is not assigned t

Fwd: Assign Custom receiver to a scheduler pool

2017-06-13 Thread Rabin Banerjee
HI All, I have a spark streaming job, which is running fine. And in that job I am using FAIR scheduler with multiple pools. One pool for main streaming job. And another for some parallel batch processing . All is working as expected, except .. "The custom receiver that I used is not assigned t

how to debug app with cluster mode please?

2017-06-13 Thread ??????????
Hi all, I am learning spark 2.1code. I write app with "master[4]", I run and debug code.It works well. when I change code with "master[2,2??1024]" and debug it as before, I meet error as follow: java.lang.classnotfindexception: com.xxx.xxx$$anonfun$main$1 the class is my main class. Would

having trouble using structured streaming with file sink (parquet)

2017-06-13 Thread Mendelson, Assaf
Hi all, I have recently started assessing structured streaming and ran into a little snag from the beginning. Basically I wanted to read some data, do some basic aggregation and write the result to file: import org.apache.spark.sql.functions.avg import org.apache.spark.sql.streaming.Processing

Re: Scala, Python or Java for Spark programming

2017-06-13 Thread Irving Duran
I agree with most of the statements. I would add that there is more support in the python community than scala (nowadays). Which it will make it easier for a new programmer to learn. On Sat, Jun 10, 2017 at 6:43 PM vaquar khan wrote: > It's depends on programming style ,I would like to say set

SANSA 0.2 (Semantic Technologies on top of Spark) Released

2017-06-13 Thread Jens Lehmann
Dear all, The Smart Data Analytics group [1] is happy to announce SANSA 0.2 - the second release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing for semantic technologies in order to allow scalable machine learning, inference and querying capabilities for large knowl

Java access to internal representation of DataTypes.DateType

2017-06-13 Thread Anton Kravchenko
How one would access to internal representation of DataTypes.DateType from Spark (2.0.1) Java API? From https://github.com/apache/spark/blob/51b1c1551d3a7147403b9e821fcc7c8f57b4824c/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala : "Internally, this is represented as the numb

Read Local File

2017-06-13 Thread Dirceu Semighini Filho
Hi all, I'm trying to read a File from local filesystem, I'd 4 workstations 1 Master and 3 slaves, running with Ambari and Yarn with Spark version* 2.1.1.2.6.1.0-129* The code that I'm trying to run is quite simple spark.sqlContext.read.text("file:///pathToFile").count I've copied the file in al

UDF percentile_approx

2017-06-13 Thread Andrés Ivaldi
Hello, I`m trying to user percentile_approx on my SQL query, but It's like spark context can´t find the function I'm using it like this import org.apache.spark.sql.functions._ import org.apache.spark.sql.DataFrameStatFunctions val e = expr("percentile_approx(Cantidadcon0234514)") df.agg(e).show(

Spark Streaming Design Suggestion

2017-06-13 Thread Shashi Vishwakarma
Hi I have to design a spark streaming application with below use case. I am looking for best possible approach for this. I have application which pushing data into 1000+ different topics each has different purpose . Spark streaming will receive data from each topic and after processing it will wr

Re: Spark Streaming Design Suggestion

2017-06-13 Thread Jörn Franke
I do not fully understand the design here. Why not send all to one topic with some application id in the message and you write to one topic also indicating the application id. Can you elaborate a little bit more on the use case? Especially applications deleting/creating topics dynamically can b

Join pushdown on two external tables from the same external source?

2017-06-13 Thread drewrobb
I'm trying to figure out how to multiple tables from a single external source directly in spark sql. Say I do the following in spark SQL: CREATE OR REPLACE TEMPORARY VIEW t1 USING jdbc OPTIONS ( dbtable 't1' ...) CREATE OR REPLACE TEMPORARY VIEW t2 USING jdbc OPTIONS ( dbtable 't2' ...) SELECT *

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-13 Thread bo yang
Thanks Benjamin and Ayan for the feedback! You kind of represent two group of people who need such script tool or not. Personally I find the script is very useful for myself to write ETL pipelines and daily jobs. Let's see whether there are other people interested in such project. Best, Bo On

having trouble using structured streaming with file sink (parquet)

2017-06-13 Thread AssafMendelson
Hi all, I have recently started assessing structured streaming and ran into a little snag from the beginning. Basically I wanted to read some data, do some basic aggregation and write the result to file: import org.apache.spark.sql.functions.avg import org.apache.spark.sql.streaming.Processing