spark metadata metastore bug ?

2022-01-06 Thread Nicolas Paris
Spark can't see hive schema updates partly because it stores the schema in a weird way in hive metastore. 1. FROM SPARK: create a table >>> spark.sql("select 1 col1, 2 >>> col2").write.format("parquet").saveAsTable("my_table") >>> spark.table("my_table").printSchema() root |--

Re: Choice of IDE for Spark

2021-10-01 Thread Nicolas Paris
> With IntelliJ you are OK with Spark & Scala. also intelliJ as a nice python plugin that turns it into pycharm. On Thu Sep 30, 2021 at 1:57 PM CEST, Jeff Zhang wrote: > IIRC, you want an IDE for pyspark on yarn ? > > Mich Talebzadeh 于2021年9月30日周四 > 下午7:00写道: > > > Hi, > > > > This may look

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Nicolas Paris
as a workaround turn off pruning : spark.sql.hive.metastorePartitionPruning false spark.sql.hive.convertMetastoreParquet false see https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45 On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote: > Hi, > > I

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Nicolas Paris
ss(ClassLoader.java:357) > > > Thanks > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The au

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Nicolas Paris
le and replace it with the same >>>> version but package* it works! >>>> >>>> >>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars >>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar >>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6* >>>> >>>> >>>> I have read the write-ups about packages searching the maven >>>> libraries etc. Not convinced why using the package should make so much >>>> difference between a failure and success. In other words, when to use a >>>> package rather than a jar. >>>> >>>> >>>> Any ideas will be appreciated. >>>> >>>> >>>> Thanks >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>> -- > Best Regards, > Ayan Guha -- nicolas paris - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Count distinct and driver memory

2020-10-19 Thread Nicolas Paris
hich results in millions of tasks. > Perhaps the high memory usage is a side effect of caching the results of lots > of tasks. > > On 10/19/20, 1:27 PM, "Nicolas Paris" wrote: > > CAUTION: This email originated from outside of the organization. Do not >

Re: Count distinct and driver memory

2020-10-19 Thread Nicolas Paris
10 million > distinct values into the driver? Is countDistinct not recommended for data > frames with large number of distinct values? > > What’s the solution? Should I use approx._count_distinct? -- nicolas paris - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Time-based frequency table at scale

2020-03-11 Thread Nicolas Paris
to a dense > format > 4) GroupBy/aggregating ids by timestamp, converting to a sparse, > frequency-based vector using CountVectorizer, and then expanding to a dense > format > > Any other approaches we could try? > > Thanks! > Sakshi -- nicolas paris - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SPARK Suitable IDE

2020-03-05 Thread Nicolas Paris
Holden Karau writes: > I work in emacs with ensime. the ensime project was stoped and the project archived. its successor "metals" works well for scala >=2.12 any good ressource to setup ensime with emacs ? can't wait overall spark community goes on scala 2.12

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

2020-02-22 Thread Nicolas PARIS
>>> There is no dataframe spark API which writes/creates a single file >>> instead of directory as a result of write operation. >>> >>> Below both options will create directory with a random file name. >>> >>&g

Re: Questions about count() performance with dataframes and parquet files

2020-02-18 Thread Nicolas PARIS
a memory-over-CPU trade-off. > > Enrico > > > Am 17.02.20 um 22:06 schrieb Nicolas PARIS: >>> .dropDuplicates() \ .cache() | >>> Since df_actions is cached, you can count inserts and updates quickly >>> with only that one join in df_actions: >

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Nicolas PARIS
va:0 => Exchange/MapPartitionsRDD >> [81]count at NativeMethodAccessorImpl.java:0 >> >> The other observation I have found that if I remove the counts from >> the data frame operations and instead open the outputted parquet >> field and count using a >> `sql_cont

Ceph / Lustre VS hdfs comparison

2020-02-12 Thread Nicolas PARIS
Hi Anyone has experience in ceph / lustre as a replacement of hdfs for spark storage (parquet, orc..)? Is hdfs still far superior to the former ? Thanks -- nicolas paris - To unsubscribe e-mail: user-unsubscr

detect idle sparkcontext to release resources

2020-01-23 Thread Nicolas Paris
hi we have many users on the spark on yarn cluster. most of them forget to release their sparkcontext after analysis (spark-thrift or pyspark jupyter kernels). I wonder how to detect their is no activity on the sparkcontext to kill them. Thanks -- nicolas

Best approach to write UDF

2020-01-21 Thread Nicolas Paris
Hi I have written spark udf and I am able to use them in spark scala / pyspark by using the org.apache.spark.sql.api.java.UDFx API. I d'like to use them in spark-sql thought thrift. I tried to create the functions "create function as 'org.my.MyUdf'". however I get the below error when using it:

Re: Identify bottleneck

2019-12-20 Thread Nicolas Paris
apparently the "withColumn" issue only apply for hundred or thousand of calls. This was not the case here (twenty calls) On Fri, Dec 20, 2019 at 08:53:16AM +0100, Enrico Minack wrote: > The issue is explained in depth here: https://medium.com/@manuzhang/ >

Re: SparkR integration with Hive 3 spark-r

2019-11-18 Thread Nicolas Paris
Hi Alfredo my 2 cents: To my knowlegde and reading the spark3 pre-release note, it will handle hive metastore 2.3.5 - no mention of hive 3 metastore. I made several tests on this in the past[1] and it seems to handle any hive metastore version. However spark cannot read hive managed table AKA

announce: spark-postgres 3 released

2019-11-10 Thread Nicolas Paris
Hello spark users, Spark-postgres is designed for reliable and performant ETL in big-data workload and offer read/write/scd capability . The version 3 introduces a datasource API and simplifies the usage. It outperforms sqoop by factor 8 and the apache spark core jdbc by infinity. Features: -

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-31 Thread Nicolas Paris
have you deactivated the spark.ui ? I have read several thread explaining the ui can lead to OOM because it stores 1000 dags by default On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote: > Dear List, > > I've observed some sort of memory leak when using pyspark to run ~100 > jobs in

Re: graphx vs graphframes

2019-10-17 Thread Nicolas Paris
> extends the proposed 3.0 changes in a compatible way, are active. > > Yrs, > > Alastair > > > Alastair Green > > Query Languages Standards and Research > > > Neo4j UK Ltd > > Union House > 182-194 Union Street > London, SE1 0LH > >

graphx vs graphframes

2019-09-22 Thread Nicolas Paris
hi all graphframes was intended to replace graphx. however the former looks not maintained anymore while the latter is still active. any thought ? -- nicolas - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Call Oracle Sequence using Spark

2019-08-16 Thread Nicolas Paris
> I have to call Oracle sequence using spark. You might use jdbc and write your own lib from scala I did such thing for postgres (https://framagit.org/parisni/spark-etl/tree/master/spark-postgres) see sqlExecWithResultSet On Thu, Aug 15, 2019 at 10:58:11PM +0530, rajat kumar wrote: > Hi

Re: Announcing Delta Lake 0.3.0

2019-08-06 Thread Nicolas Paris
> • Scala/Java APIs for DML commands - You can now modify data in Delta Lake > tables using programmatic APIs for Delete, Update and Merge. These APIs > mirror the syntax and semantics of their corresponding SQL commands and > are > great for many workloads, e.g., Slowly Changing

Re: New Spark Datasource for Hive ACID tables

2019-07-27 Thread Nicolas Paris
Congrats The read/write feature with hive3 is highly interesting On Fri, Jul 26, 2019 at 06:07:55PM +0530, Abhishek Somani wrote: > Hi All, > > We at Qubole have open sourced a datasource that will enable users to work on > their Hive ACID Transactional Tables using Spark.  > > Github: 

Re: Avro large binary read memory problem

2019-07-23 Thread Nicolas Paris
gt; to delete this message and destroy any printed copies. >   > > -Original Message- > From: Nicolas Paris > Sent: Tuesday, July 23, 2019 6:56 PM > To: user@spark.apache.org > Subject: Avro large binary read memory problem > > Hi > > I have those avro file wit

Avro large binary read memory problem

2019-07-23 Thread Nicolas Paris
Hi I have those avro file with the schema id:Long, content:Binary the binary are large image with a maximum of 2GB of size. I d like to get a subset of row "where id in (...)" Sadly I get memory errors even if the subset is 0 of size. It looks like the reader stores the binary information

write csv does not handle \r correctly

2019-07-13 Thread Nicolas Paris
hi spark 2.4.1 The csv writer does not quote string columns when they contain the \r carriage return character. It works as expected for both \n and \r\n \r is considered as a newline by many parsers, and spark should consider it as a newline marker. thanks -- nicolas

timestamp column orc problem with hive

2019-07-13 Thread Nicolas Paris
Hi spark 2.4.1 hive 1.2.2 The orc files saved as tables from spark are not working correctly with hive. A "timestampCol is not null" does not work as expected. The parquet format works as expected for the same input. Is this a known issue ? thanks -- nicolas

Re: run new spark version on old spark cluster ?

2019-05-21 Thread Nicolas Paris
; most likely have to set something in spark-defaults.conf like > > spark.master yarn > spark.submit.deployMode client > > On Mon, May 20, 2019 at 3:14 PM Nicolas Paris > wrote: > > Finally that was easy to connect to both hive/hdfs. I just had to copy > th

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
ported" installs, so you > could also look into that if you are not comfortable with running your own > spark build. > > On Mon, May 20, 2019 at 2:24 PM Nicolas Paris > wrote: > > > correct. note that you only need to install spark on the node you launch >

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
d components between spark jobs on yarn are only really > spark-shuffle-service in yarn and spark-history-server. i have found > compatibility for these to be good. its best if these run latest version. > > On Mon, May 20, 2019 at 2:02 PM Nicolas Paris > wrote: > >

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
ppily run multiple spark versions side-by-side > you will need the spark version you intend to launch with on the machine you > launch from and point to the correct spark-submit > > On Mon, May 20, 2019 at 1:50 PM Nicolas Paris > wrote: > > Hi > > I

run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
Hi I am wondering whether that's feasible to: - build a spark application (with sbt/maven) based on spark2.4 - deploy that jar on yarn on a spark2.3 based installation thanks by advance, -- nicolas - To unsubscribe e-mail:

Re: log level in spark

2019-05-11 Thread Nicolas Paris
That's all right, i manage to reduce the log level by removing the logback dependency in the pom.xml On Sat, May 11, 2019 at 02:54:49PM +0200, Nicolas Paris wrote: > Hi > > I have a spark code source with tests that create sparkSessions. > > I am running spark testing fra

log level in spark

2019-05-11 Thread Nicolas Paris
Hi I have a spark code source with tests that create sparkSessions. I am running spark testing framework. My concern is I am not able to configure the log level to INFO. I have large debug traces such: > DEBUG org.spark_project.jetty.util.Jetty - > java.lang.NumberFormatException: For input

Re: pySpark - pandas UDF and binaryType

2019-05-04 Thread Nicolas Paris
org/jira/browse/SPARK-23555. Also, pyarrow 0.10.0 or greater > is require as you saw in the docs. > > Bryan > > On Thu, May 2, 2019 at 4:26 AM Nicolas Paris > wrote: > > Hi all > > I am using pySpark 2.3.0 and pyArrow 0.10.0 > &

pySpark - pandas UDF and binaryType

2019-05-02 Thread Nicolas Paris
Hi all I am using pySpark 2.3.0 and pyArrow 0.10.0 I want to apply a pandas-udf on a dataframe with I have the bellow error: > Invalid returnType with grouped map Pandas UDFs: > StructType(List(StructField(filename,StringType,true),StructField(contents,BinaryType,true))) > is not supported

Re: [SQL] 64-bit hash function, and seeding

2019-03-05 Thread Nicolas Paris
Hi Huon Good catch. A 64 bit hash is definitely a useful function. > the birthday paradox implies >50% chance of at least one for tables larger > than 77000 rows Do you know how many rows to have 50% chances for a 64 bit hash ? About the seed column, to me there is no need for such an

Connect to hive 3 from spark

2019-03-04 Thread Nicolas Paris
Hi all Do anybody knows if spark spark able to connect to hive metastore for hive 3 (metastore v3)? I know spark cannot deal with transactional tables, however I wonder if at least it can read/write non-transactional tables from hive 3. Thanks -- nicolas

Re: Postgres Read JDBC with COPY TO STDOUT

2018-12-31 Thread Nicolas Paris
, Nicolas Paris wrote: > Hi > > The spark postgres JDBC reader is limited because it relies on basic > SELECT statements with fetchsize and crashes on large tables even if > multiple partitions are setup with lower/upper bounds. > > I am about writing a new postgres JDBC reader bas

Postgres Read JDBC with COPY TO STDOUT

2018-12-29 Thread Nicolas Paris
Hi The spark postgres JDBC reader is limited because it relies on basic SELECT statements with fetchsize and crashes on large tables even if multiple partitions are setup with lower/upper bounds. I am about writing a new postgres JDBC reader based on "COPY TO STDOUT". It would stream the data

Re: jdbc spark streaming

2018-12-28 Thread Nicolas Paris
data into something like Kafka and then > use Spark streaming against that. > Taking that Kafka approach further, you can capture the delta upstream so > that the processing that pushes it into the RDBMS can also push it to Kafka > directly. > > On 12/27/18, 4:52 PM, "Nicolas Paris

jdbc spark streaming

2018-12-27 Thread Nicolas Paris
Hi I have this living RDBMS and I d'like to apply a spark job on several tables once new data get in. I could run batch spark jobs thought cron jobs every minutes. But the job takes time and resources to begin (sparkcontext, yarn) I wonder if I could run one instance of a spark streaming

Re: streaming pdf

2018-11-19 Thread Nicolas Paris
source management (?) On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: > Why does it have to be a stream? > > > Am 18.11.2018 um 23:29 schrieb Nicolas Paris : > > > > Hi > > > > I have pdf to load into spark with at least > > format. I have con

streaming pdf

2018-11-18 Thread Nicolas Paris
Hi I have pdf to load into spark with at least format. I have considered some options: - spark streaming does not provide a native file stream for binary with variable size (binaryRecordStream specifies a constant size) and I would have to write my own receiver. - Structured streaming

Re: How to avoid long-running jobs blocking short-running jobs

2018-11-03 Thread Nicolas Paris
On Sat, Nov 03, 2018 at 02:04:01AM -0700, conner wrote: > My solution is to find a good way to divide the spark cluster resource > into two. What about yarn and its queue management system ? -- nicolas - To unsubscribe

Re: Process Million Binary Files

2018-10-11 Thread Nicolas PARIS
Hi Joel I built such pipeline to transform pdf-> text https://github.com/EDS-APHP/SparkPdfExtractor You can take a look It transforms 20M pdfs in 2 hours on a 5 node spark cluster Le 2018-10-10 23:56, Joel D a écrit : > Hi, > > I need to process millions of PDFs in hdfs using spark. First I’m

Re: csv reader performance with multiline option

2018-08-18 Thread Nicolas Paris
Hi yes, multiline would only use one thread in that case. The csv parser used by spark is uniVocity Le 18 août 2018 à 18:07, Nirav Patel écrivait : > does enabling 'multiLine' option impact performance? how? would it run read > entire file with just one thread? > > Thanks > > > > What's

Re: Best way to process this dataset

2018-06-19 Thread Nicolas Paris
Hi Raymond Spark works well on single machine too, since it benefits from multiple core. The csv parser is based on univocity and you might use the "spark.read.csc" syntax instead of using the rdd api; >From my experience, this will better than any other csv parser 2018-06-19 16:43 GMT+02:00

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Nicolas Paris
IMO your json cannot be read in parallell at all then spark only offers you to play again with memory. I d'say at one step it has to feet in both one executor and in the driver. I d'try something like 20GB for both driver and executors and by using dynamic amount of executor in order to then

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Nicolas Paris
have you played with driver/executor memory configuration ? Increasing them should avoid OOM 2018-06-05 22:30 GMT+02:00 raksja : > Agreed, gzip or non splittable, the question that i have and examples i > have > posted above all are referring to non compressed file. A single json file > with

Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Nicolas Paris
hi corey not familiar with arrow, plasma. However recently read an article about spark on a standalone machine (your case). Sounds like you could take benefit of pyspark "as-is" https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html regars, 2018-05-23

Re:

2018-05-16 Thread Nicolas Paris
Hi I would go for a regular mysql bulkload. I m saying writing an output that mysql is able to load in one process. I d'say spark jdbc is ok for small fetch/load. When comes large RDBMS call, it turns out using the regular optimized API is better than jdbc 2018-05-16 16:18 GMT+02:00 Vadim

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
guys here the illustration https://github.com/parisni/SparkPdfExtractor Please add issues if any questions or improvement ideas Enjoy Cheers 2018-04-23 20:42 GMT+02:00 unk1102 : > Thanks much Nicolas really appreciate it. > > > > -- > Sent from:

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
sure then let me recap steps: 1. load pdfs in a local folder to hdfs avro 2. load avro in spark as a RDD 3. apply pdfbox to each csv and return content as string 4. write the result as a huge csv file That's some work guys for me to push all that. Should find some time however within 7 days

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
2018-04-23 18:59 GMT+02:00 unk1102 : > Hi Nicolas thanks much for the reply. Do you have any sample code > somewhere? > ​I have some open-source code. I could find time to push on github if needed.​ > Do your just keep pdf in avro binary all the time? ​yes, I store

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
Hi Problem is number of files on hadoop; I deal with 50M pdf files. What I did is to put them in an avro table on hdfs, as a binary column. Then I read it with spark and push that into pdfbox. Transforming 50M pdfs into text took 2hours on a 5 computers clusters About colors and formating, I

Re: Accessing Hive Database (On Hadoop) using Spark

2018-04-15 Thread Nicolas Paris
Hi Sounds your configuration files are not well filed. What does : spark.sql("SHOW DATABASES").show(); outputs ? If you only have default database, such investigation there should help https://stackoverflow.com/questions/47257680/unable-to-get-existing-hive-tables-from-hivecontext-using-spark

Re: Does Pyspark Support Graphx?

2018-02-18 Thread Nicolas Paris
> Most likely not as most of the effort is currently on GraphFrames  - a great > blog post on the what GraphFrames offers can be found at: https:// Is the graphframes package still active ? The github repository indicates it's not extremelly active. Right now, there is no available package for

Re: ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-02-09 Thread Nicolas Paris
t; > Bryan > > On Wed, Jan 31, 2018 at 10:20 PM, Nicolas Paris <nipari...@gmail.com> wrote: > > Hey > > I am also interested in how to get those parameters. > For example, the demo code spark-2.2.1-bin-hadoop2.7/examples/src/main/ > python/ml/est

Re: ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-01-31 Thread Nicolas Paris
Hey I am also interested in how to get those parameters. For example, the demo code spark-2.2.1-bin-hadoop2.7/examples/src/main/python/ml/estimator_transformer_param_example.py return empty parameters when printing "lr.extractParamMap()" That's weird Thanks Le 30 janv. 2018 à 23:10, Bryan

Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-28 Thread Nicolas Paris
Hi Thanks for this work. Will this affect both: 1) spark.read.format("orc").load("...") 2) spark.sql("select ... from my_orc_table_in_hive") ? Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait : > Hi, All. > > Vectorized ORC Reader is now supported in Apache Spark 2.3. > >    

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-22 Thread Nicolas Paris
layer (hive/kerberos) - direct partitionned lazy datasets versus complicated jdbc dataset management - more robust for analytics with less memory (apparently) However presto still makes sence for sub second analytics, and oltp like queries and data discovery. Le 05 nov. 2017 à 13:57, Nicolas Paris

Re: pySpark driver memory limit

2017-11-08 Thread Nicolas Paris
Le 06 nov. 2017 à 19:56, Nicolas Paris écrivait : > Can anyone clarify the driver memory aspects of pySpark? > According to [1], spark.driver.memory limits JVM + python memory. > > In case: > spark.driver.memory=2G > Then does it mean the user won't be able to use more

pySpark driver memory limit

2017-11-06 Thread Nicolas Paris
hi there Can anyone clarify the driver memory aspects of pySpark? According to [1], spark.driver.memory limits JVM + python memory. In case: spark.driver.memory=2G Then does it mean the user won't be able to use more than 2G, whatever the python code + the RDD stuff he is using ? Thanks, [1]:

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 22:46, ayan guha écrivait : > Thank you for the clarification. That was my understanding too. However how to > provide the upper bound as it changes for every call in real life. For example > it is not required for sqoop.  True. AFAIK sqoop begins with doing a "select

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 22:02, ayan guha écrivait : > Can you confirm if JDBC DF Reader actually loads all data from source to > driver > memory and then distributes to the executors? apparently yes when not using partition column > And this is true even when a > partition column is provided? No,

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
s, > Gourav Sengupta > > > On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris <nipari...@gmail.com> wrote: > > Hi > > After some testing, I have been quite disapointed with hiveContext way of > accessing hive tables. > > The main problem is resourc

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
s, > Gourav > > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <nipari...@gmail.com> wrote: > > > I do not think that SPARK will automatically determine the partitions. > Actually > > it does not automatically determine the partitions. In case a

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
> I do not think that SPARK will automatically determine the partitions. > Actually > it does not automatically determine the partitions. In case a table has a few > million records, it all goes through the driver. Hi Gourav Actualy spark jdbc driver is able to deal direclty with partitions.

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
e entire data? This works for static datasets, or when new data is comming by batch processes, the spark application should be reloaded to get the new files in the folder >> On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris <nipari...@gmail.com> wrote: > > Le 03 oct. 2017 à 20:0

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait : > I wonder the differences accessing HIVE tables in two different ways: > - with jdbc access > - with sparkContext Well there is also a third way to access the hive data from spark: - with direct file access (here ORC format) For exam

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-13 Thread Nicolas Paris
> In case a table has a few > million records, it all goes through the driver. This sounds clear in JDBC mode, the driver get all the rows and then it spreads the RDD over the executors. I d'say that most use cases deal with SQL to aggregate huge datasets, and retrieve small amount of rows to be

Hive From Spark: Jdbc VS sparkContext

2017-10-03 Thread Nicolas Paris
Hi I wonder the differences accessing HIVE tables in two different ways: - with jdbc access - with sparkContext I would say that jdbc is better since it uses HIVE that is based on map-reduce / TEZ and then works on disk. Using spark rdd can lead to memory errors on very huge datasets. Anybody

Working with hadoop har file in spark

2017-08-17 Thread Nicolas Paris
Hi I put million files into a har archive on hdfs. I d'like to iterate over their file paths, and read them. (Basically they are pdf, and I want to transform them into text with apache pdfbox) My first attempts has been to list them with hadoop command `hdfs dfs -ls har:///user//har/pdf.har`